Turbo Decoders for Audio-visual Continuous Speech Recognition

被引：5

作者：

Abdelaziz, Ahmed Hussen ^{[1
]}

机构：

[1] Int Comp Sci Inst, Berkeley, CA 94704 USA

来源：

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年

关键词：

Turbo decoding; audio-visual speech recognition; audio-visual fusion; noise-robustness; ASR;

D O I：

10.21437/Interspeech.2017-799

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual speech. i.e., video recordings of speakers' mouths, plays an important role in improving the robustness properties of automatic speech recognition (ASR) against noise. Optimal fusion of audio and video modalities is still one of the major challenges that attracts significant interest in the realm of audiovisual ASR. Recently, turbo decoders (TDs) have been successful in addressing the audio-visual fusion problem. The idea of the TD framework is to iteratively exchange some kind of soft information between the audio and video decoders until convergence. The forward-backward algorithm (FBA) is mostly applied to the decoding graphs to estimate this soft information. Applying the FBA to the complex graphs that are usually used in large vocabulary tasks may be computationally expensive. In this paper, I propose to apply the forward-backward algorithm to a lattice of most likely state sequences instead of using the entire decoding graph. Using lattices allows for TD to be easily applied to large vocabulary tasks. The proposed approach is evaluated using the newly released TCD-TIMIT corpus. where a standard recipe for large vocabulary ASR is employed. The modified TD performs significantly better than the feature and decision fusion models in all clean and noisy test conditions.

引用

页码：3667 / 3671

页数：5

共 50 条

[11] Building a data corpus for audio-visual speech recognition
Chitu, Alin G.
Rothkrantz, Leon J. M.
[J]. EUROMEDIA '2007, 2007, : 88 - 92
[12] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
Shao, Xu
Barker, Jon
[J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
[13] DARE: Deceiving Audio-Visual speech Recognition model
Mishra, Saumya
Gupta, Anup Kumar
Gupta, Puneet
[J]. KNOWLEDGE-BASED SYSTEMS, 2021, 232
[14] Dynamic Bayesian Networks for Audio-Visual Speech Recognition
Ara V. Nefian
Luhong Liang
Xiaobo Pi
Xiaoxing Liu
Kevin Murphy
[J]. EURASIP Journal on Advances in Signal Processing, 2002
[15] Connectionism based audio-visual speech recognition method
Che, Na
Zhu, Yi-Ming
Zhao, Jian
Sun, Lei
Shi, Li-Juan
Zeng, Xian-Wei
[J]. Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2024, 54 (10): : 2984 - 2993
[16] Research on Robust Audio-Visual Speech Recognition Algorithms
Yang, Wenfeng
Li, Pengyi
Yang, Wei
Liu, Yuxing
He, Yulong
Petrosian, Ovanes
Davydenko, Aleksandr
[J]. MATHEMATICS, 2023, 11 (07)
[17] Dynamic Bayesian networks for audio-visual speech recognition
Nefian, AV
Liang, LH
Pi, XB
Liu, XX
Murphy, K
[J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1274 - 1288
[18] On Dynamic Stream Weighting for Audio-Visual Speech Recognition
Estellers, Virginia
Gurban, Mihai
Thiran, Jean-Philippe
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04): : 1145 - 1157
[19] Audio-visual speech recognition using deep learning
Noda, Kuniaki
Yamaguchi, Yuki
Nakadai, Kazuhiro
Okuno, Hiroshi G.
Ogata, Tetsuya
[J]. APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
[20] DAVIS: Driver's Audio-Visual Speech Recognition
Ivanko, Denis
Ryumin, Dmitry
Kashevnik, Alexey
Axyonov, Alexandr
Kitenko, Andrey
Lashkov, Igor
Karpov, Alexey
[J]. INTERSPEECH 2022, 2022, : 1141 - 1142

← 1 2 3 4 5 →