Visual model structures and synchrony constraints for audio-visual speech recognition

被引：36

作者：

Hazen, TJ ^{[1
]}

机构：

[1] MIT, Artificial Intelligence Lab, Cambridge, MA 02139 USA

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2006年 / 14卷 / 03期

关键词：

audio-visual speech recognition; lip-reading; multimodal speech processing;

D O I：

10.1109/TSA.2005.857572

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents the design and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. The audio and visual feature streams are integrated using a segment-constrained hidden Markov model, which allows the visual classifier to process visual frames with a constrained amount of asynchrony relative to proposed acoustic segments. The core experiments in this paper investigate several different visual model structures, each of which provides a different means for defining the units of the visual classifier and the synchrony constraints between the audio and visual streams. Word recognition experiments are conducted on the AV-TIMIT corpus under variable additive noise conditions. Over varying acoustic signal-to-noise ratios, word error rate reductions between 14% and 60% are observed when integrating the visual information into the automatic speech recognition process.

引用

页码：1082 / 1089

页数：8

共 36 条

[1]

Bengio S., 2003, ADV NEURAL INFORM PR, P1237

[2]

BENOIT C, 2000, STRUCTURE MULTIMODAL, V2, P485

[3]

Bregler C., 1993, P INT C AC SPEECH SI, P557

[4] Real-time lip tracking and bimodal continuous speech recognition [J].

Chan, MT ;

Zhang, Y ;

Huang, TS .

1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, :65-70

[5] Audio-visual integration in multimodal communication [J].

Chen, T ;

Rao, RR .

PROCEEDINGS OF THE IEEE, 1998, 86 (05) :837-852

[6] A review of speech-based bimodal recognition [J].

Chibelushi, CC ;

Deravi, F ;

Mason, JSD .

IEEE TRANSACTIONS ON MULTIMEDIA, 2002, 4 (01) :23-37

[7]

CHU S, 2002, P ICASSP, V2, P2009

[8]

CHU S, 2000, P INT C SPOK LANG PR, V2

[9] Audio-Visual Speech Modeling for Continuous Speech Recognition [J].

Dupont, Stephane ;

Luettin, Juergen .

IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151

[10] AUTOMATIC OPTICALLY-BASED RECOGNITION OF SPEECH [J].

FINN, KE ;

MONTGOMERY, AA .

PATTERN RECOGNITION LETTERS, 1988, 8 (03) :159-164

← 1 2 3 4 →