Audio-visual feature fusion via deep neural networks for automatic speech recognition

被引：20

作者：

Rahmani, Mohammad Hasan ^{[1
]}

Almasganj, Farshad ^{[1
]}

Seyyedsalehi, Seyyed Ali ^{[1
]}

机构：

[1] Amirkabir Univ Technol, Biomed Engn Dept, Hafez Ave, Tehran, Iran

来源：

DIGITAL SIGNAL PROCESSING | 2018年 / 82卷

关键词：

Audio-visual speech recognition; Deep autoencoder; Deep neural networks; Feature extraction; Multimodal information processing; NOISE;

D O I：

10.1016/j.dsp.2018.06.004

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The brain-like functionality of the artificial neural networks besides their great performance in various areas of scientific applications, make them a reliable tool to be employed in Audio-Visual Speech Recognition (AVSR) systems. The applications of such networks in the AVSR systems extend from the preliminary stage of feature extraction to the higher levels of information combination and speech modeling. In this paper, some carefully designed deep autoencoders are proposed to produce efficient bimodal features from the audio and visual stream inputs. The basic proposed structure is modified in three proceeding steps to make better usage of the presence of the visual information from the speakers' lips Region of Interest (ROI). The performance of the proposed structures is compared to both the unimodal and bimodal baselines in a professional phoneme recognition task, under different noisy audio conditions. This is done by employing a state-of-the-art DNN-HMM hybrid as the speech classifier. In comparison to the MFCC audio-only features, the finally proposed bimodal features cause an average relative reduction of 36.9% for a range of different noisy conditions, and also, a relative reduction of 19.2% for the clean condition in terms of the Phoneme Error Rates (PER). (C) 2018 Elsevier Inc. All rights reserved.

引用

页码：54 / 63

页数：10

共 28 条

[1] [Anonymous], 2011, P 28 INT C MACH LEAR
[2] [Anonymous], 2009, NIPS WORKSH DEEP LEA
[3] [Anonymous], 2011, WORKSH AUT SPEECH RE
[4] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
Dahl, George E.
Yu, Dong
Deng, Li
Acero, Alex
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
[5] Speech recognition with artificial neural networks
Dede, Guelin
Sazli, Murat Huesnue
[J]. DIGITAL SIGNAL PROCESSING, 2010, 20 (03) : 763 - 768
[6] Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains
Gauvain, Jean-Luc
Lee, Chin-Hui
[J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (02): : 291 - 298
[7] Grézl F, 2007, INT CONF ACOUST SPEE, P757
[8] Reducing the dimensionality of data with neural networks
Hinton, G. E.
Salakhutdinov, R. R.
[J]. SCIENCE, 2006, 313 (5786) : 504 - 507
[9] A fast learning algorithm for deep belief nets
Hinton, Geoffrey E.
Osindero, Simon
Teh, Yee-Whye
[J]. NEURAL COMPUTATION, 2006, 18 (07) : 1527 - 1554
[10] Huang J, 2013, INT CONF ACOUST SPEE, P7596, DOI 10.1109/ICASSP.2013.6639140

← 1 2 3 →