Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation

被引:50
作者
Zou, ShiHao [1 ]
Huang, Xianying [1 ]
Shen, XuDong [1 ]
Liu, Hankai [1 ]
机构
[1] Chongqing Univ Technol, Coll Comp Sci & Engn, Chongqing 400054, Peoples R China
关键词
Emotion recognition in conversation; Main modal; Transformer; Multihead attention; Emotional cues; LANGUAGE;
D O I
10.1016/j.knosys.2022.109978
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion recognition in conversation (ERC) is essential for developing empathic conversation systems. In conversation, emotions can exist in multiple modalities, i.e., audio, text, and visual. Due to the inherent characteristics of each modality, it is not easy for the model to use all modalities effectively when fusing modal information. However, existing approaches consider the same representation ability of each modality, resulting in unsatisfactory fusion across modalities. Therefore, we consider different modalities with different representation abilities, propose the concept of the main modal, i.e., the modal with stronger representation ability after feature extraction, and then propose the method of Main Modal Transformer (MMTr) to improve the effect of multimodal fusion. The method preserves the integrity of the main modal features and enhances the representation of weak modalities by using multihead attention to learn the information interactions between modalities. In addition, we designed a new emotional cue extractor that extracts emotional cues from two levels (the speaker's self-context and the contextual context in conversation) to enrich the conversation information obtained by each modal. Extensive experiments on two benchmark datasets validate the effectiveness and superiority of our model.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:9
相关论文
共 39 条
[21]   Contrastive Multimodal Fusion with TupleInfoNCE [J].
Liu, Yunze ;
Fan, Qingnan ;
Zhang, Shanghang ;
Dong, Hao ;
Funkhouser, Thomas ;
Yi, Li .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :734-743
[22]  
Liu Z, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2247
[23]   A multi-view network for real-time emotion recognition in conversations [J].
Ma, Hui ;
Wang, Jian ;
Lin, Hongfei ;
Pan, Xuejun ;
Zhang, Yijia ;
Yang, Zhihao .
KNOWLEDGE-BASED SYSTEMS, 2022, 236
[24]   HAN-ReGRU: hierarchical attention network with residual gated recurrent unit for emotion recognition in conversation [J].
Ma, Hui ;
Wang, Jian ;
Qian, Lingfei ;
Lin, Hongfei .
NEURAL COMPUTING & APPLICATIONS, 2021, 33 (07) :2685-2703
[25]   A survey on empathetic dialogue systems [J].
Ma, Yukun ;
Nguyen, Khanh Linh ;
Xing, Frank Z. ;
Cambria, Erik .
INFORMATION FUSION, 2020, 64 :50-70
[26]  
Majumder N, 2019, AAAI CONF ARTIF INTE, P6818
[27]  
Ni JJ, 2022, AAAI CONF ARTIF INTE, P11112
[28]   The evolution of language [J].
Nowak, MA ;
Krakauer, DC .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (14) :8028-8033
[29]   Human language as a culturally transmitted replicator [J].
Pagel, Mark .
NATURE REVIEWS GENETICS, 2009, 10 (06) :405-415
[30]  
Poria S, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P527