Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition

被引：3

作者：

Guo, Peini ^{[1
,2
]}

Chen, Zhengyan ^{[1
]}

Li, Yidi ^{[1
]}

Liu, Hong ^{[1
]}

机构：

[1] Peking Univ, Shenzhen Grad Sch, Key Lab Machine Percept, Shenzhen, Peoples R China

[2] Shanghai Univ, Shanghai, Peoples R China

来源：

ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II | 2022年 / 13605卷

关键词：

Multimodal emotion recognition; Audio-visual fusion; Convolutional Neural Network; Transformer;

D O I：

10.1007/978-3-031-20500-2_26

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual emotion recognition aims to integrate audio and visual information for accurate emotion prediction, which is widely used in real application scenarios. However, most existing methods lack fully exploiting complementary information within modalities to obtain rich feature representations related to emotions. Recently, Transformer and CNN-based models achieve remarkable results in the field of automatic speech recognition. Motivated by this, we propose a novel audiovisual fusion network based on 3D-CNN and Convolution-augmented Transformer (Conformer) for multimodal emotion recognition. Firstly, the 3D-CNN is employed to process face sequences extracted from the video, and the 1D-CNN is used to process MFCC features of audio signals. Secondly, the visual and audio features are fed into a feature fusion module, which contains a set of convolutional layers for extracting local features and the self-attention mechanism for capturing global interactions of multimodal information. Finally, the fused features are input into linear layers to obtain the prediction results. To verify the effectiveness of the proposed method, experiments are performed on RAVDESS and a newly collected dataset named PKU-ER. The experimental results show that the proposed model achieves state-of-the-art performance in audio-only, video-only, and audio-visual fusion experiments.

引用

页码：315 / 326

页数：12

共 50 条

[1] Audio-Visual Learning for Multimodal Emotion Recognition
Fan, Siyu
Jing, Jianan
Wang, Chongwen
SYMMETRY-BASEL, 2025, 17 (03):
[2] Multimodal Attentive Fusion Network for audio-visual event recognition
Brousmiche, Mathilde
Rouat, Jean
Dupont, Stephane
INFORMATION FUSION, 2022, 85 : 52 - 59
[3] Fish behavior recognition based on an audio-visual multimodal interactive fusion network
Yang, Yuxin
Yu, Hong
Zhang, Xin
Zhang, Peng
Tu, Wan
Gu, Lishuai
AQUACULTURAL ENGINEERING, 2024, 107
[4] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
Zhang, Shiqing
Zhang, Shiliang
Huang, Tiejun
Gao, Wen
ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
[5] Metric Learning-Based Multimodal Audio-Visual Emotion Recognition
Ghaleb, Esam
Popa, Mirela
Asteriadis, Stylianos
IEEE MULTIMEDIA, 2020, 27 (01) : 37 - 48
[6] Fusion of Classifier Predictions for Audio-Visual Emotion Recognition
Noroozi, Fatemeh
Marjanovic, Marina
Njegus, Angelina
Escalera, Sergio
Anbarjafari, Gholamreza
2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 61 - 66
[7] Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition
Ghaleb, Esam
Popa, Mirela
Asteriadis, Stylianos
2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
[8] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
Matsuda, Yuki
Fedotov, Dmitrii
Takahashi, Yuta
Arakawa, Yutaka
Yasumo, Keiichi
Minker, Wolfgang
PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951
[9] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
Middya, Asif Iqbal
Nag, Baibhav
Roy, Sarbani
KNOWLEDGE-BASED SYSTEMS, 2022, 244
[10] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
Middya, Asif Iqbal
Nag, Baibhav
Roy, Sarbani
KNOWLEDGE-BASED SYSTEMS, 2022, 244

← 1 2 3 4 5 →