Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation

被引：35

作者：

Zou, ShiHao ^{[1
]}

Huang, Xianying ^{[1
]}

Shen, XuDong ^{[1
]}

Liu, Hankai ^{[1
]}

机构：

[1] Chongqing Univ Technol, Coll Comp Sci & Engn, Chongqing 400054, Peoples R China

来源：

KNOWLEDGE-BASED SYSTEMS | 2022年 / 258卷

关键词：

Emotion recognition in conversation; Main modal; Transformer; Multihead attention; Emotional cues; LANGUAGE;

D O I：

10.1016/j.knosys.2022.109978

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion recognition in conversation (ERC) is essential for developing empathic conversation systems. In conversation, emotions can exist in multiple modalities, i.e., audio, text, and visual. Due to the inherent characteristics of each modality, it is not easy for the model to use all modalities effectively when fusing modal information. However, existing approaches consider the same representation ability of each modality, resulting in unsatisfactory fusion across modalities. Therefore, we consider different modalities with different representation abilities, propose the concept of the main modal, i.e., the modal with stronger representation ability after feature extraction, and then propose the method of Main Modal Transformer (MMTr) to improve the effect of multimodal fusion. The method preserves the integrity of the main modal features and enhances the representation of weak modalities by using multihead attention to learn the information interactions between modalities. In addition, we designed a new emotional cue extractor that extracts emotional cues from two levels (the speaker's self-context and the contextual context in conversation) to enrich the conversation information obtained by each modal. Extensive experiments on two benchmark datasets validate the effectiveness and superiority of our model.(c) 2022 Elsevier B.V. All rights reserved.

引用

页数：9

共 39 条

[1] [Anonymous], 2017, IN P IEEE C COMPUTER
[2] [Anonymous], 2011, P 13 INT C MULT INT
[3] Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution
Barsoum, Emad
Zhang, Cha
Ferrer, Cristian Canton
Zhang, Zhengyou
[J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 279 - 283
[4] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[5] Devillers L, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P801
[6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7] Eyben F., 2010, P 18 ACM INT C MULT, P1459, DOI DOI 10.1145/1873951.1874246
[8] How language shapes the cultural inheritance of categories
Gelman, Susan A.
Roberts, Steven O.
[J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2017, 114 (30) : 7900 - 7907
[9] Ghosal D, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P154
[10] Han, 2021, P EMPIRICAL METHODS

← 1 2 3 4 →