基于深度学习的两类典型语音处理问题研究

更新时间:2023-05-08 13:03:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

国防科学技术大学研究生院硕士学位论文

ABSTRACT

Deep learning is one of the most advanced research fields in artificial intelligence, and it has made astonishing progress in computer vision, speech processing, robot control, and bioinformatics. Deep learning conducts analysis and learning in a way of simulating human brain, and generates complex concepts by abstracting and combining simple concepts. Comparing with conventional machine learning algorithms, deep learning does not extract hand-crafted features.

In this thesis, we studied two typical deep learning based application problems in speech processing, namely audio matching and audio visual speech recognition. From the viewpoint of engineering, audio matching and speech recognition are key technologies of speech processing, and have been widely used in speech retrieval and intelligence analysis. From the viewpoint of theoretical study, audio matching and speech recognition are typical unsupervised problem and supervised problem in speech processing, respectively. Researches on deep learning models for these two kinds of problems are of great academic value. There are following major contributions: First, to improve the generalization capabilities of traditional audio matching methods, this thesis proposed to extract audio features via Convolutional Deep Belief Networks (CDBNs). CDBNs combine advantages of Convolutional Neural Networks (CNNs) which deal with high dimensional data and those of Deep Belief Networks (DBNs) that conduct unsupervised learning, and can extract features with strong generalization capabilities from high dimensional audio data in an unsupervised way. Based on the binary features extracted by CDBN, we proposed a faster audio feature matching algorithm. Experimental results show that CDBN based audio matching algorithm significantly improves the hit rate of audio matching, compared with traditional chroma energy normalized statistics feature based audio matching algorithm.

Second, to integrate both temporal characteristics of audio information and video information, this thesis proposed a multimodal Recurrent Neural Network (RNN) framework for multimodal speech recognition. The framework consists of an auditory part for processing audio data, a visual part for processing video data, and a fusion part for combining both the auditory and visual parts. The experimental results demonstrate that the proposed speech recognition system based on multimodal RNN successfully combines video features and audio features, and effectively improves speech recognition accuracy based on audio data only, especially on the low SNR dataset.

Key Words:Deep learning, speech processing, audio matching, audio visual speech recognition

第ii 页

国防科学技术大学研究生院硕士学位论文

第 iii 页 英文缩写词对照表

CDBN

卷积深度置信网络(Convolutional Deep Belief Network) CENS

色度能量归一化统计(Chroma Energy Normalized Statistics) RNN

递归神经网络(Recurrent Neural Network) DNN

深度神经网络(Deep Neural Network) SGD

随机梯度下降(Stochastic Gradient Descent) ReLU

修正线性单元(Rectified Linear Unit) BP

反向传播(Back propagation) LSTM

长短时记忆(Long Short Term Memory) GMM

高斯混合模型(Gaussian Mixture Model) HMM

隐马尔科夫模型(Hidden Markov Model) DBN

深度置信网络(Deep Belief Network) CNN

卷积神经网络(Convolutional Neural Network) RBM

受限玻尔兹曼机(Restricted Boltzmann Machine) CRBM

卷积受限玻尔兹曼机(Convolutional Restricted Boltzmann Machine) AVSR

听觉-视觉语音识别(Audio-Visual Speech Recognition)

国防科学技术大学研究生院硕士学位论文

第iv 页

本文来源:https://www.bwwdw.com/article/3cve.html

Top