Turbo Decoders for Audio-visual Continuous Speech Recognition

被引:5
作者
Abdelaziz, Ahmed Hussen [1 ]
机构
[1] Int Comp Sci Inst, Berkeley, CA 94704 USA
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
关键词
Turbo decoding; audio-visual speech recognition; audio-visual fusion; noise-robustness; ASR;
D O I
10.21437/Interspeech.2017-799
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual speech. i.e., video recordings of speakers' mouths, plays an important role in improving the robustness properties of automatic speech recognition (ASR) against noise. Optimal fusion of audio and video modalities is still one of the major challenges that attracts significant interest in the realm of audiovisual ASR. Recently, turbo decoders (TDs) have been successful in addressing the audio-visual fusion problem. The idea of the TD framework is to iteratively exchange some kind of soft information between the audio and video decoders until convergence. The forward-backward algorithm (FBA) is mostly applied to the decoding graphs to estimate this soft information. Applying the FBA to the complex graphs that are usually used in large vocabulary tasks may be computationally expensive. In this paper, I propose to apply the forward-backward algorithm to a lattice of most likely state sequences instead of using the entire decoding graph. Using lattices allows for TD to be easily applied to large vocabulary tasks. The proposed approach is evaluated using the newly released TCD-TIMIT corpus. where a standard recipe for large vocabulary ASR is employed. The modified TD performs significantly better than the feature and decision fusion models in all clean and noisy test conditions.
引用
收藏
页码:3667 / 3671
页数:5
相关论文
共 50 条
  • [11] Building a data corpus for audio-visual speech recognition
    Chitu, Alin G.
    Rothkrantz, Leon J. M.
    [J]. EUROMEDIA '2007, 2007, : 88 - 92
  • [12] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
    Shao, Xu
    Barker, Jon
    [J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
  • [13] DARE: Deceiving Audio-Visual speech Recognition model
    Mishra, Saumya
    Gupta, Anup Kumar
    Gupta, Puneet
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 232
  • [14] Dynamic Bayesian Networks for Audio-Visual Speech Recognition
    Ara V. Nefian
    Luhong Liang
    Xiaobo Pi
    Xiaoxing Liu
    Kevin Murphy
    [J]. EURASIP Journal on Advances in Signal Processing, 2002
  • [15] Connectionism based audio-visual speech recognition method
    Che, Na
    Zhu, Yi-Ming
    Zhao, Jian
    Sun, Lei
    Shi, Li-Juan
    Zeng, Xian-Wei
    [J]. Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2024, 54 (10): : 2984 - 2993
  • [16] Research on Robust Audio-Visual Speech Recognition Algorithms
    Yang, Wenfeng
    Li, Pengyi
    Yang, Wei
    Liu, Yuxing
    He, Yulong
    Petrosian, Ovanes
    Davydenko, Aleksandr
    [J]. MATHEMATICS, 2023, 11 (07)
  • [17] Dynamic Bayesian networks for audio-visual speech recognition
    Nefian, AV
    Liang, LH
    Pi, XB
    Liu, XX
    Murphy, K
    [J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1274 - 1288
  • [18] On Dynamic Stream Weighting for Audio-Visual Speech Recognition
    Estellers, Virginia
    Gurban, Mihai
    Thiran, Jean-Philippe
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04): : 1145 - 1157
  • [19] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    [J]. APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
  • [20] DAVIS: Driver's Audio-Visual Speech Recognition
    Ivanko, Denis
    Ryumin, Dmitry
    Kashevnik, Alexey
    Axyonov, Alexandr
    Kitenko, Andrey
    Lashkov, Igor
    Karpov, Alexey
    [J]. INTERSPEECH 2022, 2022, : 1141 - 1142