Turbo Decoders for Audio-visual Continuous Speech Recognition

被引：5

作者：

Abdelaziz, Ahmed Hussen ^{[1
]}

机构：

[1] Int Comp Sci Inst, Berkeley, CA 94704 USA

来源：

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年

关键词：

Turbo decoding; audio-visual speech recognition; audio-visual fusion; noise-robustness; ASR;

D O I：

10.21437/Interspeech.2017-799

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual speech. i.e., video recordings of speakers' mouths, plays an important role in improving the robustness properties of automatic speech recognition (ASR) against noise. Optimal fusion of audio and video modalities is still one of the major challenges that attracts significant interest in the realm of audiovisual ASR. Recently, turbo decoders (TDs) have been successful in addressing the audio-visual fusion problem. The idea of the TD framework is to iteratively exchange some kind of soft information between the audio and video decoders until convergence. The forward-backward algorithm (FBA) is mostly applied to the decoding graphs to estimate this soft information. Applying the FBA to the complex graphs that are usually used in large vocabulary tasks may be computationally expensive. In this paper, I propose to apply the forward-backward algorithm to a lattice of most likely state sequences instead of using the entire decoding graph. Using lattices allows for TD to be easily applied to large vocabulary tasks. The proposed approach is evaluated using the newly released TCD-TIMIT corpus. where a standard recipe for large vocabulary ASR is employed. The modified TD performs significantly better than the feature and decision fusion models in all clean and noisy test conditions.

引用

页码：3667 / 3671

页数：5

共 28 条

[1] Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition [J].

Abdelaziz, Ahmed Hussen ;

Zeiler, Steffen ;

Kolossa, Dorothea .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (05) :863-876

[2]

[Anonymous], IEEE

[3]

Barker J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P504, DOI 10.1109/ASRU.2015.7404837

[4]

BERROU C, 1993, IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS 93 : TECHNICAL PROGRAM, CONFERENCE RECORD, VOLS 1-3, P1064, DOI 10.1109/ICC.1993.397441

[5] ALGORITHM FOR COMPUTER CONTROL OF A DIGITAL PLOTTER [J].

BRESENHAM, JE .

IBM SYSTEMS JOURNAL, 1965, 4 (01) :25-30

[6] A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER) [J].

Fiscus, JG .

1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, 1997, :347-354

[7] Maximum likelihood linear transformations for HMM-based speech recognition [J].

Gales, MJF .

COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) :75-98

[8]

Garofolo J., 1993, TIMIT ACOUSTIC PHONE, V33

[9]

Gergen S., 2016, P INTERSPEECH, P2241

[10] TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech [J].

Harte, Naomi ;

Gillen, Eoin .

IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (05) :603-615

← 1 2 3 →