Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition

被引：33

作者：

Abdelaziz, Ahmed Hussen ^{[1
]}

机构：

[1] Int Comp Sci Inst, Berkeley, CA 94704 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2018年 / 26卷 / 03期

关键词：

Audiovisual speech recognition; audiovisual fusion; automatic lipreading; multistream hidden Markov model (HMM); coupled HMM; turbo decoders; audiovisual automatic speech recognition (AV-ASR) benchmarks; NOISE; ALGORITHM;

D O I：

10.1109/TASLP.2017.2783545

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audiovisual fusion is one of the most challenging tasks that continues to attract substantial research interest in the field of audiovisual automatic speech recognition (AV-ASR). In the last few decades, many approaches for integrating the audio and video modalities were proposed to enhance the performance of automatic speech recognition in both clean and noisy conditions. However, very few studies can be found in the literature that compare different fusion models for AV-ASR. Even less research work compares audiovisual fusion models for large vocabulary continuous speech recognition (LVCSR) models using deep neural networks (DNNs). This paper reviews and compares the performance of five audiovisual fusion models: the feature fusion model, the decision fusion model, the multistream hiddenMarkovmodel (HMM), the coupled HMM, and the turbo decoders. A complete evaluation of these fusion models is conducted using a standard speaker-independent DNN-based LVCSR Kaldi recipe in three experimental setups: a clean-train-clean-test, a clean-train-noisy-test, and a matched-training setup. All experiments have been applied to the recently released NTCD-TIMIT audiovisual corpus. The task of NTCD-TIMIT is phone recognition in continuous speech. Using NTCD-TIMIT with its freely available visual features and 37 clean and noisy acoustic signals allows for this study to be a common benchmark, to which novel LVCSR AV-ASR models and approaches can be compared.

引用

页码：475 / 484

页数：10

共 58 条

[1]

Abdelaziz A. H., 2014, P IEEE INT C AC SPEE, P1546

[2] NTCD-TIMIT: A New Database and Baseline for Noise-robust Audio-visual Speech Recognition [J].

Abdelaziz, Ahmed Hussen .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3752-3756

[3] Turbo Decoders for Audio-visual Continuous Speech Recognition [J].

Abdelaziz, Ahmed Hussen .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3667-3671

[4]

Abdelaziz AH, 2017, IEEE INT CON MULTI, P1081, DOI 10.1109/ICME.2017.8019294

[5]

Abdelaziz AH, 2014, INTERSPEECH, P1144

[6] Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition [J].

Abdelaziz, Ahmed Hussen ;

Zeiler, Steffen ;

Kolossa, Dorothea .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (05) :863-876

[7]

[Anonymous], CORR

[8]

[Anonymous], 2011, P 28 INT C MACH LEAR

[9]

[Anonymous], 2013, P INTERSPEECH

[10]

[Anonymous], 2015, IPSJ Transactions on Computer Vision and Applications

← 1 2 3 4 5 6 →