Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition

被引:51
|
作者
Song, Qiya [1 ,2 ]
Sun, Bin [1 ,2 ]
Li, Shutao [1 ,2 ]
机构
[1] Hunan Univ, Coll Elect & Informat Engn, Changsha 410082, Hunan, Peoples R China
[2] Hunan Univ, Key Lab Visual Percept & Artificial Intelligence, Changsha 410082, Hunan, Peoples R China
关键词
Visualization; Speech recognition; Transformers; Lips; Feature extraction; Noise measurement; Task analysis; Audio-visual speech recognition (AVSR); cross-modal attention; motion information; multimodal; sparse transformer;
D O I
10.1109/TNNLS.2022.3163771
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic speech recognition (ASR) is the major human-machine interface in many intelligent systems, such as intelligent homes, autonomous driving, and servant robots. However, its performance usually significantly deteriorates in the presence of external noise, leading to limitations of its application scenes. The audio-visual speech recognition (AVSR) takes visual information as a complementary modality to enhance the performance of audio speech recognition effectively, particularly in noisy conditions. Recently, the transformer-based architectures have been used to model the audio and video sequences for the AVSR, which achieves a superior performance. However, its performance may be degraded in these architectures due to extracting irrelevant information while modeling long-term dependences. In addition, the motion feature is essential for capturing the spatio-temporal information within the lip region to best utilize visual sequences but has not been considered in the AVSR tasks. Therefore, we propose a multimodal sparse transformer network (MMST) in this article. The sparse self-attention mechanism can improve the concentration of attention on global information by selecting the most relevant parts wisely. Moreover, the motion features are seamlessly introduced into the MMST model. We subtly allow motion-modality information to flow into visual modality through the cross-modal attention module to enhance visual features, thereby further improving recognition performance. Extensive experiments conducted on different datasets validate that our proposed method outperforms several state-of-the-art methods in terms of the word error rate (WER).
引用
收藏
页码:10028 / 10038
页数:11
相关论文
共 50 条
  • [1] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [2] An audio-visual corpus for multimodal automatic speech recognition
    Andrzej Czyzewski
    Bozena Kostek
    Piotr Bratoszewski
    Jozef Kotus
    Marcin Szykulski
    Journal of Intelligent Information Systems, 2017, 49 : 167 - 192
  • [3] An audio-visual corpus for multimodal automatic speech recognition
    Czyzewski, Andrzej
    Kostek, Bozena
    Bratoszewski, Piotr
    Kotus, Jozef
    Szykulski, Marcin
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
  • [4] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
    Paulin, Hebsibah
    Milton, R. S.
    JanakiRaman, S.
    Chandraprabha, K.
    JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
  • [5] Indonesian Audio-Visual Speech Corpus for Multimodal Automatic Speech Recognition
    Maulana, Muhammad Rizki Aulia Rahman
    Fanany, Mohamad Ivan
    2017 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2017, : 381 - 385
  • [6] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [7] AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio-Visual Speech Recognition
    Che, Na
    Zhu, Yiming
    Wang, Haiyan
    Zeng, Xianwei
    Du, Qinsheng
    APPLIED SCIENCES-BASEL, 2025, 15 (01):
  • [8] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED ACOUSTICS, 2023, 211
  • [9] Audio-Visual Action Recognition Using Transformer Fusion Network
    Kim, Jun-Hwa
    Won, Chee Sun
    APPLIED SCIENCES-BASEL, 2024, 14 (03):
  • [10] Multimodal Attentive Fusion Network for audio-visual event recognition
    Brousmiche, Mathilde
    Rouat, Jean
    Dupont, Stephane
    INFORMATION FUSION, 2022, 85 : 52 - 59