MULTI-SCALE HYBRID FUSION NETWORK FOR MANDARIN AUDIO-VISUAL SPEECH RECOGNITION

被引：0

作者：

Wang, Jinxin ^{[1
]}

Guo, Zhongwen ^{[1
]}

Yang, Chao ^{[2
]}

Li, Xiaomei ^{[1
]}

Cui, Ziyuan ^{[1
]}

机构：

[1] Ocean Univ China, Fac Informat Sci & Engn, Qingdao, Peoples R China

[2] Univ Technol Sydney, Sch Comp Sci, Sydney, Australia

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年

关键词：

Audio-visual recognition; deep learning; multi-modality feature extraction;

D O I：

10.1109/ICME55011.2023.00116

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the optimal combination of different predicted results. In this paper, we propose a multi-scale hybrid fusion network (MSHF) for mandarin audio-visual speech recognition. Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi-scale feature extraction module (MSFE) to obtain multi-scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information, optimizing the weights of prediction results for different modalities to achieve the best classification. We further design a feature recognition module (FRM) for accurate audio-visual speech recognition. We conducted experiments on the CAS-VSR-W1k dataset. The experimental results show that the proposed method outperforms the selected competitive baselines and the state-of-the-art, indicating the superiority of our proposed modules.

引用

页码：642 / 647

页数：6

共 50 条

[31] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
Wei, Jie
Hu, Guanyu
Yang, Xinyu
Luu, Anh Tuan
Dong, Yizhuo
INTERSPEECH 2022, 2022, : 1988 - 1992
[32] Performance Improvement of Audio-Visual Speech Recognition with Optimal Reliability Fusion
Tariquzzaman, Md
Gyu, Song Min
Young, Kim Jin
You, Na Seung
Rashid, M. A.
2010 THE 3RD INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION (PACIIA2010), VOL III, 2010, : 216 - 219
[33] Multimodal Attentive Fusion Network for audio-visual event recognition
Brousmiche, Mathilde
Rouat, Jean
Dupont, Stephane
INFORMATION FUSION, 2022, 85 : 52 - 59
[34] Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions
Sad, Gonzalo D.
Terissi, Lucas D.
Gomez, Juan C.
PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2016, 2017, 10125 : 360 - 367
[35] Audio-Visual Action Recognition Using Transformer Fusion Network
Kim, Jun-Hwa
Won, Chee Sun
APPLIED SCIENCES-BASEL, 2024, 14 (03):
[36] CATNet: Cross-modal fusion for audio-visual speech recognition
Wang, Xingmei
Mi, Jiachen
Li, Boquan
Zhao, Yixu
Meng, Jiaxiang
PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
[37] An audio-visual speech recognition system for testing new audio-visual databases
Pao, Tsang-Long
Liao, Wen-Yuan
VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
[38] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
Zhang, Zi-Qiang
Zhang, Jie
Zhang, Jian-Shu
Wu, Ming-Hui
Fang, Xin
Dai, Li-Rong
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
[39] Multi-Modal and Multi-Scale Temporal Fusion Architecture Search for Audio-Visual Video Parsing
Zhang, Jiayi
Li, Weixin
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3328 - 3336
[40] Multi-scale network with shared cross-attention for audio-visual correlation learning
Zhang, Jiwei
Yu, Yi
Tang, Suhua
Li, Wei
Wu, Jianming
NEURAL COMPUTING & APPLICATIONS, 2023, 35 (27): : 20173 - 20187

← 1 2 3 4 5 →