Audio-visual feature fusion via deep neural networks for automatic speech recognition

被引:20
作者
Rahmani, Mohammad Hasan [1 ]
Almasganj, Farshad [1 ]
Seyyedsalehi, Seyyed Ali [1 ]
机构
[1] Amirkabir Univ Technol, Biomed Engn Dept, Hafez Ave, Tehran, Iran
关键词
Audio-visual speech recognition; Deep autoencoder; Deep neural networks; Feature extraction; Multimodal information processing; NOISE;
D O I
10.1016/j.dsp.2018.06.004
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The brain-like functionality of the artificial neural networks besides their great performance in various areas of scientific applications, make them a reliable tool to be employed in Audio-Visual Speech Recognition (AVSR) systems. The applications of such networks in the AVSR systems extend from the preliminary stage of feature extraction to the higher levels of information combination and speech modeling. In this paper, some carefully designed deep autoencoders are proposed to produce efficient bimodal features from the audio and visual stream inputs. The basic proposed structure is modified in three proceeding steps to make better usage of the presence of the visual information from the speakers' lips Region of Interest (ROI). The performance of the proposed structures is compared to both the unimodal and bimodal baselines in a professional phoneme recognition task, under different noisy audio conditions. This is done by employing a state-of-the-art DNN-HMM hybrid as the speech classifier. In comparison to the MFCC audio-only features, the finally proposed bimodal features cause an average relative reduction of 36.9% for a range of different noisy conditions, and also, a relative reduction of 19.2% for the clean condition in terms of the Phoneme Error Rates (PER). (C) 2018 Elsevier Inc. All rights reserved.
引用
收藏
页码:54 / 63
页数:10
相关论文
共 28 条
  • [1] [Anonymous], 2011, P 28 INT C MACH LEAR
  • [2] [Anonymous], 2009, NIPS WORKSH DEEP LEA
  • [3] [Anonymous], 2011, WORKSH AUT SPEECH RE
  • [4] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
  • [5] Speech recognition with artificial neural networks
    Dede, Guelin
    Sazli, Murat Huesnue
    [J]. DIGITAL SIGNAL PROCESSING, 2010, 20 (03) : 763 - 768
  • [6] Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
    Gauvain, Jean-Luc
    Lee, Chin-Hui
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (02): : 291 - 298
  • [7] Grézl F, 2007, INT CONF ACOUST SPEE, P757
  • [8] Reducing the dimensionality of data with neural networks
    Hinton, G. E.
    Salakhutdinov, R. R.
    [J]. SCIENCE, 2006, 313 (5786) : 504 - 507
  • [9] A fast learning algorithm for deep belief nets
    Hinton, Geoffrey E.
    Osindero, Simon
    Teh, Yee-Whye
    [J]. NEURAL COMPUTATION, 2006, 18 (07) : 1527 - 1554
  • [10] Huang J, 2013, INT CONF ACOUST SPEE, P7596, DOI 10.1109/ICASSP.2013.6639140