Audio-visual speech recognition using deep learning

被引:390
作者
Noda, Kuniaki [1 ]
Yamaguchi, Yuki [2 ]
Nakadai, Kazuhiro [3 ]
Okuno, Hiroshi G. [2 ]
Ogata, Tetsuya [1 ]
机构
[1] Waseda Univ, Grad Sch Fundamental Sci & Engn, Tokyo 1698555, Japan
[2] Kyoto Univ, Grad Sch Informat, Kyoto 6068501, Japan
[3] Honda Res Inst Japan Co Ltd, Saitama 3510114, Japan
关键词
Audio-visual speech recognition; Feature extraction; Deep learning; Multi-stream HMM; NEURAL-NETWORKS; EXTRACTION; FEATURES;
D O I
10.1007/s10489-014-0629-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition performance. In the machine-learning community, deep learning approaches have recently attracted increasing attention because deep neural networks can effectively extract robust latent features that enable various recognition algorithms to demonstrate revolutionary generalization capabilities under diverse application conditions. This study introduces a connectionist-hidden Markov model (HMM) system for noise-robust AVSR. First, a deep denoising autoencoder is utilized for acquiring noise-robust audio features. By preparing the training data for the network with pairs of consecutive multiple steps of deteriorated audio features and the corresponding clean features, the network is trained to output denoised audio features from the corresponding features deteriorated by noise. Second, a convolutional neural network (CNN) is utilized to extract visual features from raw mouth area images. By preparing the training data for the CNN as pairs of raw images and the corresponding phoneme label outputs, the network is trained to predict phoneme labels from the corresponding mouth area input images. Finally, a multi-stream HMM (MSHMM) is applied for integrating the acquired audio and visual HMMs independently trained with the respective features. By comparing the cases when normal and denoised mel-frequency cepstral coefficients (MFCCs) are utilized as audio features to the HMM, our unimodal isolated word recognition results demonstrate that approximately 65 % word recognition rate gain is attained with denoised MFCCs under 10 dB signal-to-noise-ratio (SNR) for the audio signal input. Moreover, our multimodal isolated word recognition results utilizing MSHMM with denoised MFCCs and acquired visual features demonstrate that an additional word recognition rate gain is attained for the SNR conditions below 10 dB.
引用
收藏
页码:722 / 737
页数:16
相关论文
共 52 条
[41]   Acoustic Modeling Using Deep Belief Networks [J].
Mohamed, Abdel-rahman ;
Dahl, George E. ;
Hinton, Geoffrey .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :14-22
[42]   FAST EXACT MULTIPLICATION BY THE HESSIAN [J].
PEARLMUTTER, BA .
NEURAL COMPUTATION, 1994, 6 (01) :147-160
[43]   Connectionist Probability Estimators in HMM Speech Recognition [J].
Renals, Steve ;
Morgan, Nelson ;
Bourlard, Herve ;
Cohen, Michael ;
Franco, Horacio .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (01) :161-174
[44]  
Sainath TN, 2012, INT CONF ACOUST SPEE, P4153, DOI 10.1109/ICASSP.2012.6288833
[45]   Feature analysis for automatic speechreading [J].
Scanlon, P ;
Reilly, R .
2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, :625-630
[46]   Fast curvature matrix-vector products for second-order gradient descent [J].
Schraudolph, NN .
NEURAL COMPUTATION, 2002, 14 (07) :1723-1738
[47]  
Slaney M., 1998, Auditory Toolbox: A Matlab Toolbox for Auditory Modeling Work
[48]  
Sutskever I., 2011, P 28 INT C INT C MAC, V28, P1017
[49]  
Vincent P, 2010, J MACH LEARN RES, V11, P3371
[50]  
Vincent Pascal, 2008, P 25 INT C MACH LEAR, P2, DOI [10.1145/1390156.1390294, DOI 10.1145/1390156.1390294]