Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation

被引:35
作者
Zou, ShiHao [1 ]
Huang, Xianying [1 ]
Shen, XuDong [1 ]
Liu, Hankai [1 ]
机构
[1] Chongqing Univ Technol, Coll Comp Sci & Engn, Chongqing 400054, Peoples R China
关键词
Emotion recognition in conversation; Main modal; Transformer; Multihead attention; Emotional cues; LANGUAGE;
D O I
10.1016/j.knosys.2022.109978
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion recognition in conversation (ERC) is essential for developing empathic conversation systems. In conversation, emotions can exist in multiple modalities, i.e., audio, text, and visual. Due to the inherent characteristics of each modality, it is not easy for the model to use all modalities effectively when fusing modal information. However, existing approaches consider the same representation ability of each modality, resulting in unsatisfactory fusion across modalities. Therefore, we consider different modalities with different representation abilities, propose the concept of the main modal, i.e., the modal with stronger representation ability after feature extraction, and then propose the method of Main Modal Transformer (MMTr) to improve the effect of multimodal fusion. The method preserves the integrity of the main modal features and enhances the representation of weak modalities by using multihead attention to learn the information interactions between modalities. In addition, we designed a new emotional cue extractor that extracts emotional cues from two levels (the speaker's self-context and the contextual context in conversation) to enrich the conversation information obtained by each modal. Extensive experiments on two benchmark datasets validate the effectiveness and superiority of our model.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:9
相关论文
共 39 条
  • [1] [Anonymous], 2017, IN P IEEE C COMPUTER
  • [2] [Anonymous], 2011, P 13 INT C MULT INT
  • [3] Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution
    Barsoum, Emad
    Zhang, Cha
    Ferrer, Cristian Canton
    Zhang, Zhengyou
    [J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 279 - 283
  • [4] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [5] Devillers L, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P801
  • [6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [7] Eyben F., 2010, P 18 ACM INT C MULT, P1459, DOI DOI 10.1145/1873951.1874246
  • [8] How language shapes the cultural inheritance of categories
    Gelman, Susan A.
    Roberts, Steven O.
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2017, 114 (30) : 7900 - 7907
  • [9] Ghosal D, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P154
  • [10] Han, 2021, P EMPIRICAL METHODS