Multi-modal Emotion Recognition Network with Balanced Audio-visual Feature Extraction

被引:0
作者
Chen, Zeyu [1 ]
Wu, Yiming [1 ]
Cao, Ronghui [2 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha, Peoples R China
[2] Changsha Univ Sci & Technol, Sch Comp & Commun Engn, Changsha, Peoples R China
来源
2024 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER ENGINEERING, ICAICE | 2024年
基金
中国国家自然科学基金;
关键词
deep learning; multi-modal; emotion recognition; Transformer;
D O I
10.1109/ICAICE63571.2024.10864010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video and audio are ways for humans to perceive the world beyond language. It is meant to enable robots to imitate and recognize human emotional expressions. However, most current audio-visual emotion analysis models tend to extract deep features from only one modality. In contrast, the other modality plays a supporting role and cannot fully extract deep features from both modalities. This article proposes a balanced audiovisual emotion analysis model. Specifically, EfficientNet and Wav2vec 2.0 are used for visual and auditory modality feature extraction, respectively, ensuring that in-depth features can be extracted from both modalities. Secondly, we use Transformer as the decision-level fusion operator, exchanging information between the two modalities. We verified our model on the RAVDESS dataset and achieved a Top1 accuracy of 88.54%, surpassing audio-visual emotion analysis models of the auxiliary type.
引用
收藏
页码:675 / 679
页数:5
相关论文
共 18 条
[1]   Toward User-Independent Emotion Recognition Using Physiological Signals [J].
Albraikan, Amani ;
Tobon, Diana P. ;
El Saddik, Abdulmotaleb .
IEEE SENSORS JOURNAL, 2019, 19 (19) :8402-8412
[2]   Multimodal emotional state recognition using sequence-dependent deep hierarchical features [J].
Barros, Pablo ;
Jirak, Doreen ;
Weber, Cornelius ;
Wermter, Stefan .
NEURAL NETWORKS, 2015, 72 :140-151
[3]   Deep Learning for Multimodal Emotion Recognition-Attentive Residual Disconnected RNN [J].
Chandra, Erick ;
Hsu, Jane Yung-jen .
2019 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI), 2019,
[4]   Multimodal human emotion/expression recognition [J].
Chen, LS ;
Huang, TS ;
Miyasato, T ;
Nakatsu, R .
AUTOMATIC FACE AND GESTURE RECOGNITION - THIRD IEEE INTERNATIONAL CONFERENCE PROCEEDINGS, 1998, :366-371
[5]   Self-attention fusion for audiovisual emotion recognition with incomplete data [J].
Chumachenko, Kateryna ;
Iosifidis, Alexandros ;
Gabbouj, Moncef .
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2822-2828
[6]  
Eyben F., 2010, P 18 ACM INT C MULT, P1459, DOI [10.1145/1873951.1874246, DOI 10.1145/1873951.1874246]
[7]  
Haiwei Xue, 2020, 2020 International Conference on Machine Learning and Cybernetics (ICMLC), P169, DOI 10.1109/ICMLC51923.2020.9469572
[8]   Multimodal Emotion Recognition Based on Ensemble Convolutional Neural Network [J].
Huang, Haiping ;
Hu, Zhenchao ;
Wang, Wenming ;
Wu, Min .
IEEE ACCESS, 2020, 8 :3265-3271
[9]  
Kim W, 2021, PR MACH LEARN RES, V139
[10]  
Liang JJ, 2019, ASIAPAC SIGN INFO PR, P695, DOI [10.1109/apsipaasc47483.2019.9023144, 10.1109/APSIPAASC47483.2019.9023144]