Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition

被引：51

作者：

Song, Qiya ^{[1
,2
]}

Sun, Bin ^{[1
,2
]}

Li, Shutao ^{[1
,2
]}

机构：

[1] Hunan Univ, Coll Elect & Informat Engn, Changsha 410082, Hunan, Peoples R China

[2] Hunan Univ, Key Lab Visual Percept & Artificial Intelligence, Changsha 410082, Hunan, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2023年 / 34卷 / 12期

关键词：

Visualization; Speech recognition; Transformers; Lips; Feature extraction; Noise measurement; Task analysis; Audio-visual speech recognition (AVSR); cross-modal attention; motion information; multimodal; sparse transformer;

D O I：

10.1109/TNNLS.2022.3163771

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Automatic speech recognition (ASR) is the major human-machine interface in many intelligent systems, such as intelligent homes, autonomous driving, and servant robots. However, its performance usually significantly deteriorates in the presence of external noise, leading to limitations of its application scenes. The audio-visual speech recognition (AVSR) takes visual information as a complementary modality to enhance the performance of audio speech recognition effectively, particularly in noisy conditions. Recently, the transformer-based architectures have been used to model the audio and video sequences for the AVSR, which achieves a superior performance. However, its performance may be degraded in these architectures due to extracting irrelevant information while modeling long-term dependences. In addition, the motion feature is essential for capturing the spatio-temporal information within the lip region to best utilize visual sequences but has not been considered in the AVSR tasks. Therefore, we propose a multimodal sparse transformer network (MMST) in this article. The sparse self-attention mechanism can improve the concentration of attention on global information by selecting the most relevant parts wisely. Moreover, the motion features are seamlessly introduced into the MMST model. We subtly allow motion-modality information to flow into visual modality through the cross-modal attention module to enhance visual features, thereby further improving recognition performance. Extensive experiments conducted on different datasets validate that our proposed method outperforms several state-of-the-art methods in terms of the word error rate (WER).

引用

页码：10028 / 10038

页数：11

共 50 条

[1] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
Mroueh, Youssef
Marcheret, Etienne
Goel, Vaibhava
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
[2] An audio-visual corpus for multimodal automatic speech recognition
Andrzej Czyzewski
Bozena Kostek
Piotr Bratoszewski
Jozef Kotus
Marcin Szykulski
Journal of Intelligent Information Systems, 2017, 49 : 167 - 192
[3] An audio-visual corpus for multimodal automatic speech recognition
Czyzewski, Andrzej
Kostek, Bozena
Bratoszewski, Piotr
Kotus, Jozef
Szykulski, Marcin
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
[4] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
Paulin, Hebsibah
Milton, R. S.
JanakiRaman, S.
Chandraprabha, K.
JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
[5] Indonesian Audio-Visual Speech Corpus for Multimodal Automatic Speech Recognition
Maulana, Muhammad Rizki Aulia Rahman
Fanany, Mohamad Ivan
2017 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2017, : 381 - 385
[6] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
Su, Rongfeng
Wang, Lan
Liu, Xunying
2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
[7] AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio-Visual Speech Recognition
Che, Na
Zhu, Yiming
Wang, Haiyan
Zeng, Xianwei
Du, Qinsheng
APPLIED SCIENCES-BASEL, 2025, 15 (01):
[8] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
APPLIED ACOUSTICS, 2023, 211
[9] Audio-Visual Action Recognition Using Transformer Fusion Network
Kim, Jun-Hwa
Won, Chee Sun
APPLIED SCIENCES-BASEL, 2024, 14 (03):
[10] Multimodal Attentive Fusion Network for audio-visual event recognition
Brousmiche, Mathilde
Rouat, Jean
Dupont, Stephane
INFORMATION FUSION, 2022, 85 : 52 - 59

← 1 2 3 4 5 →