Visual model structures and synchrony constraints for audio-visual speech recognition

被引:36
作者
Hazen, TJ [1 ]
机构
[1] MIT, Artificial Intelligence Lab, Cambridge, MA 02139 USA
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2006年 / 14卷 / 03期
关键词
audio-visual speech recognition; lip-reading; multimodal speech processing;
D O I
10.1109/TSA.2005.857572
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents the design and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. The audio and visual feature streams are integrated using a segment-constrained hidden Markov model, which allows the visual classifier to process visual frames with a constrained amount of asynchrony relative to proposed acoustic segments. The core experiments in this paper investigate several different visual model structures, each of which provides a different means for defining the units of the visual classifier and the synchrony constraints between the audio and visual streams. Word recognition experiments are conducted on the AV-TIMIT corpus under variable additive noise conditions. Over varying acoustic signal-to-noise ratios, word error rate reductions between 14% and 60% are observed when integrating the visual information into the automatic speech recognition process.
引用
收藏
页码:1082 / 1089
页数:8
相关论文
共 36 条
[1]  
Bengio S., 2003, ADV NEURAL INFORM PR, P1237
[2]  
BENOIT C, 2000, STRUCTURE MULTIMODAL, V2, P485
[3]  
Bregler C., 1993, P INT C AC SPEECH SI, P557
[4]   Real-time lip tracking and bimodal continuous speech recognition [J].
Chan, MT ;
Zhang, Y ;
Huang, TS .
1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, :65-70
[5]   Audio-visual integration in multimodal communication [J].
Chen, T ;
Rao, RR .
PROCEEDINGS OF THE IEEE, 1998, 86 (05) :837-852
[6]   A review of speech-based bimodal recognition [J].
Chibelushi, CC ;
Deravi, F ;
Mason, JSD .
IEEE TRANSACTIONS ON MULTIMEDIA, 2002, 4 (01) :23-37
[7]  
CHU S, 2002, P ICASSP, V2, P2009
[8]  
CHU S, 2000, P INT C SPOK LANG PR, V2
[9]   Audio-Visual Speech Modeling for Continuous Speech Recognition [J].
Dupont, Stephane ;
Luettin, Juergen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151
[10]   AUTOMATIC OPTICALLY-BASED RECOGNITION OF SPEECH [J].
FINN, KE ;
MONTGOMERY, AA .
PATTERN RECOGNITION LETTERS, 1988, 8 (03) :159-164