Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition

被引：33

作者：

Abdelaziz, Ahmed Hussen ^{[1
]}

机构：

[1] Int Comp Sci Inst, Berkeley, CA 94704 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2018年 / 26卷 / 03期

关键词：

Audiovisual speech recognition; audiovisual fusion; automatic lipreading; multistream hidden Markov model (HMM); coupled HMM; turbo decoders; audiovisual automatic speech recognition (AV-ASR) benchmarks; NOISE; ALGORITHM;

D O I：

10.1109/TASLP.2017.2783545

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audiovisual fusion is one of the most challenging tasks that continues to attract substantial research interest in the field of audiovisual automatic speech recognition (AV-ASR). In the last few decades, many approaches for integrating the audio and video modalities were proposed to enhance the performance of automatic speech recognition in both clean and noisy conditions. However, very few studies can be found in the literature that compare different fusion models for AV-ASR. Even less research work compares audiovisual fusion models for large vocabulary continuous speech recognition (LVCSR) models using deep neural networks (DNNs). This paper reviews and compares the performance of five audiovisual fusion models: the feature fusion model, the decision fusion model, the multistream hiddenMarkovmodel (HMM), the coupled HMM, and the turbo decoders. A complete evaluation of these fusion models is conducted using a standard speaker-independent DNN-based LVCSR Kaldi recipe in three experimental setups: a clean-train-clean-test, a clean-train-noisy-test, and a matched-training setup. All experiments have been applied to the recently released NTCD-TIMIT audiovisual corpus. The task of NTCD-TIMIT is phone recognition in continuous speech. Using NTCD-TIMIT with its freely available visual features and 37 clean and noisy acoustic signals allows for this study to be a common benchmark, to which novel LVCSR AV-ASR models and approaches can be compared.

引用

页码：475 / 484

页数：10

共 58 条

[41]

Petridis S, 2017, INT CONF ACOUST SPEE, P2592, DOI 10.1109/ICASSP.2017.7952625

[42] Recent advances in the automatic recognition of audiovisual speech [J].

Potamianos, G ;

Neti, C ;

Gravier, G ;

Garg, A ;

Senior, AW .

PROCEEDINGS OF THE IEEE, 2003, 91 (09) :1306-1326

[43]

Povey D, 2011, IEEE WORKSHOP AUTOMA

[44] Turbo Automatic Speech Recognition [J].

Receveur, Simon ;

Weiss, Robin ;

Fingscheidt, Tim .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (05) :846-862

[45] Complementarity and synergy in bimodal speech: Auditory, visual, and audio-visual identification of French oral vowels in noise [J].

Robert-Ribes, J ;

Schwartz, JL ;

Lallouache, T ;

Escudier, P .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1998, 103 (06) :3677-3689

[46] An Exploration of Large Vocabulary Tools for Small Vocabulary Phonetic Recognition [J].

Sainath, Tara N. ;

Ramabhadran, Bhuvana ;

Picheny, Michael .

2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, :359-364

[47] The IBM 2016 English Conversational Telephone Speech Recognition System [J].

Saon, George ;

Sercu, Tom ;

Rennie, Steven ;

Kuo, Hong-Kwang J. .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :7-11

[48]

SHI JB, 1994, 1994 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, P593, DOI 10.1109/CVPR.1994.323794

[49] Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition [J].

Shivappa, Shankar T. ;

Rao, Bhaskar D. ;

Trivedi, Mohan M. .

2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :2241-2244

[50]

Sridharan S., 2007, P 8 ANN C INT SPEECH, P666

← 1 2 3 4 5 6 →