Multi-modal Emotion Recognition Network with Balanced Audio-visual Feature Extraction

被引：0

作者：

Chen, Zeyu ^{[1
]}

Wu, Yiming ^{[1
]}

Cao, Ronghui ^{[2
]}

机构：

[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha, Peoples R China

[2] Changsha Univ Sci & Technol, Sch Comp & Commun Engn, Changsha, Peoples R China

来源：

2024 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER ENGINEERING, ICAICE | 2024年

基金：

中国国家自然科学基金;

关键词：

deep learning; multi-modal; emotion recognition; Transformer;

D O I：

10.1109/ICAICE63571.2024.10864010

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video and audio are ways for humans to perceive the world beyond language. It is meant to enable robots to imitate and recognize human emotional expressions. However, most current audio-visual emotion analysis models tend to extract deep features from only one modality. In contrast, the other modality plays a supporting role and cannot fully extract deep features from both modalities. This article proposes a balanced audiovisual emotion analysis model. Specifically, EfficientNet and Wav2vec 2.0 are used for visual and auditory modality feature extraction, respectively, ensuring that in-depth features can be extracted from both modalities. Secondly, we use Transformer as the decision-level fusion operator, exchanging information between the two modalities. We verified our model on the RAVDESS dataset and achieved a Top1 accuracy of 88.54%, surpassing audio-visual emotion analysis models of the auxiliary type.

引用

页码：675 / 679

页数：5

共 18 条

[1] Toward User-Independent Emotion Recognition Using Physiological Signals [J].

Albraikan, Amani ;

Tobon, Diana P. ;

El Saddik, Abdulmotaleb .

IEEE SENSORS JOURNAL, 2019, 19 (19) :8402-8412

[2] Multimodal emotional state recognition using sequence-dependent deep hierarchical features [J].

Barros, Pablo ;

Jirak, Doreen ;

Weber, Cornelius ;

Wermter, Stefan .

NEURAL NETWORKS, 2015, 72 :140-151

[3] Deep Learning for Multimodal Emotion Recognition-Attentive Residual Disconnected RNN [J].

Chandra, Erick ;

Hsu, Jane Yung-jen .

2019 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI), 2019,

[4] Multimodal human emotion/expression recognition [J].

Chen, LS ;

Huang, TS ;

Miyasato, T ;

Nakatsu, R .

AUTOMATIC FACE AND GESTURE RECOGNITION - THIRD IEEE INTERNATIONAL CONFERENCE PROCEEDINGS, 1998, :366-371

[5] Self-attention fusion for audiovisual emotion recognition with incomplete data [J].

Chumachenko, Kateryna ;

Iosifidis, Alexandros ;

Gabbouj, Moncef .

2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2822-2828

[6]

Eyben F., 2010, P 18 ACM INT C MULT, P1459, DOI [10.1145/1873951.1874246, DOI 10.1145/1873951.1874246]

[7]

Haiwei Xue, 2020, 2020 International Conference on Machine Learning and Cybernetics (ICMLC), P169, DOI 10.1109/ICMLC51923.2020.9469572

[8] Multimodal Emotion Recognition Based on Ensemble Convolutional Neural Network [J].

Huang, Haiping ;

Hu, Zhenchao ;

Wang, Wenming ;

Wu, Min .

IEEE ACCESS, 2020, 8 :3265-3271

[9]

Kim W, 2021, PR MACH LEARN RES, V139

[10]

Liang JJ, 2019, ASIAPAC SIGN INFO PR, P695, DOI [10.1109/apsipaasc47483.2019.9023144, 10.1109/APSIPAASC47483.2019.9023144]

← 1 2 →