A multimodal fusion emotion recognition method based on multitask learning and attention mechanism

被引:10
作者
Xie, Jinbao [1 ]
Wang, Jiyu [2 ]
Wang, Qingyan [2 ]
Yang, Dali [1 ]
Gu, Jinming [2 ]
Tang, Yongqiang [2 ]
Varatnitski, Yury I. [3 ]
机构
[1] Hainan Normal Univ, Coll Phys & Elect Engn, Haikou 571158, Peoples R China
[2] Harbin Univ Sci & Technol, Sch Measurement & Control Technol & Commun Engn, Harbin 150000, Peoples R China
[3] Belarusian State Univ, Fac Radiophys & Comp Technol, Minsk 220030, BELARUS
关键词
Multitasking learning; Attention mechanism; Multimodal; Emotion recognition; SENTIMENT ANALYSIS;
D O I
10.1016/j.neucom.2023.126649
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With new developments in the field of human-computer interaction, researchers are now paying attention to emotion recognition, especially multimodal emotion recognition, as emotion is a multidimensional expression. In this study, we propose a multimodal fusion emotion recognition method (MTL-BAM) based on multitask learning and the attention mechanism to tackle the major problems encountered in multimodal emotion recognition tasks regarding the lack of consideration of emotion interactions among modalities and the focus on emotion similarity among modalities while ignoring the differences. By improving the attention mechanism, the emotional contribution of each modality is further analyzed so that the emotional representations of each modality can learn from and complement each other to achieve better interactive fusion effect, thereby building a multitask learning framework. By introducing three types of monomodal emotion recognition tasks as auxiliary tasks, the model can detect emotion differences. Simultaneously, the label generation unit is introduced into the auxiliary tasks, and the monomodal emotion label value can be obtained more accurately through two proportional formulas while preventing the zero value problem. Our results show that the proposed method outperforms selected state-of-the-art methods on four evaluation indexes of emotion classification (i.e., accuracy, F1 score, MAE, and Pearson correlation coefficient). The proposed method achieved accuracy rates of 85.36% and 84.61% on the published multimodal datasets of CMU-MOSI and CMU-MOSEI, respectively, which are 2-6% higher than those of existing state-of-the-art models, demonstrating good multimodal emotion recognition performance and strong generalizability.
引用
收藏
页数:13
相关论文
共 40 条
[1]   Various syncretic co-attention network for multimodal sentiment analysis [J].
Cao, Meng ;
Zhu, Yonghua ;
Gao, Wenjing ;
Li, Mengyao ;
Wang, Shaoxiu .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (24)
[2]  
Chae M., 2018 IEEE Publications Madrid, V1, P5
[3]  
Degottex G, 2014, INT CONF ACOUST SPEE, DOI 10.1109/ICASSP.2014.6853739
[4]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5]  
Pham H, 2018, FIRST GRAND CHALLENGE AND WORKSHOP ON HUMAN MULTIMODAL LANGUAGE (CHALLENGE-HML), P53
[6]  
Han W, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P9180
[7]   MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [J].
Hazarika, Devamanyu ;
Zimmermann, Roger ;
Poria, Soujanya .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :1122-1131
[8]  
Hou M., 2019, Deep multimodal multilinear fusion with high- order polynomial pooling, V1, P10
[9]  
Kim NK, 2017, ASIAPAC SIGN INFO PR, P704, DOI 10.1109/APSIPA.2017.8282123
[10]   Multimodal Emotion Recognition Using Deep Generalized Canonical Correlation Analysis with an Attention Mechanism [J].
Lan, Yu-Ting ;
Liu, Wei ;
Lu, Bao-Liang .
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,