Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

被引：2

作者：

Shen, Xudong ^{[1
]}

Huang, Xianying ^{[1
]}

Zou, Shihao ^{[1
]}

Gan, Xinyi ^{[1
]}

机构：

[1] Chongqing Univ Technol, Coll Comp Sci & Engn, Chongqing 400054, Peoples R China

来源：

NEUROCOMPUTING | 2024年 / 582卷

基金：

中国国家自然科学基金;

关键词：

Emotion recognition in conversation; Commonsense knowledge; Transformer; Multimodal interaction; Contrastive learning;

D O I：

10.1016/j.neucom.2024.127550

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion Recognition in Conversations (ERC) aims to accurately identify the emotional labels of each utterance in a conversation, holding significant application value in human-computer interaction. Existing research suggests introducing commonsense knowledge (CSK) and multimodal information enhances model performance in ERC tasks. However, several challenges persist: (1) the neglect of complex psychological influences between utterances; (2) noise issues within modal information; (3) prediction challenges for emotion labels with few samples in different categories that exhibit semantic similarity but distinct emotional categories. To address the above problems, we propose a Multimodal Knowledge -enhanced Interactive Network with Mixed Contrastive Learning (MKIN-MCL). Firstly, we establish a knowledge aggregation graph to capture the dependencies of commonsense knowledge (CSK) between utterances during a conversation. We actively aggregate relevant knowledge information to enhance text features. Simultaneously, we apply feature filters for acoustic and visual modalities to eliminate noise and enhance feature quality. Furthermore, we implement an interactive attention module by stacking designed Cross -modal Interactive Transformers (CITs) to continuously explore the relevance between the interacting parties in their respective semantic spaces, thus improving the effectiveness of modality interaction while reducing noise generated during the interaction. Lastly, we employ the Mixed Contrastive Learning (MCL) strategy to enhance the model's ability to handle few -shot labels. This strategy utilizes unsupervised contrastive learning to improve the representation capability of the multimodal fusion features and supervised contrastive learning to extract information from few -shot labels. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, validate the effectiveness and superiority of the proposed model.

引用

页数：12

共 50 条

[1] Althoff Tim, 2016, Trans Assoc Comput Linguist, V4, P463
[2] A neuropsychological theory of positive affect and its influence on cognition
Ashby, FG
Isen, AM
Turken, U
[J]. PSYCHOLOGICAL REVIEW, 1999, 106 (03) : 529 - 550
[3] Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution
Barsoum, Emad
Zhang, Cha
Ferrer, Cristian Canton
Zhang, Zhengyou
[J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 279 - 283
[4] Bosselut A, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P4762
[5] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[6] Chatterjee A., 2019, P 13 INT WORKSHOP SE, P39
[7] Chen M., 2017, P 19 ACM INT C MULT, P163, DOI DOI 10.1145/3136755.3136801
[8] Chen T, 2020, PR MACH LEARN RES, V119
[9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10] Eyben Florian, 2010, P 18 ACM INT C MULT, P1459, DOI DOI 10.1145/1873951.1874246

← 1 2 3 4 5 →