GraphMFT: A graph network based multimodal fusion technique for emotion recognition in conversation

被引：20

作者：

Li, Jiang ^{[1
,2
,3
]}

Wang, Xiaoping ^{[1
,2
,3
]}

Lv, Guoqing ^{[1
,2
,3
]}

Zeng, Zhigang ^{[1
,2
,3
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Wuhan 430074, Peoples R China

[2] Huazhong Univ Sci & Technol, Key Lab Image Proc & Intelligent Control, Educ Minist China, Wuhan 430074, Peoples R China

[3] Huazhong Univ Sci & Technol, Hubei Key Lab Brain Inspired Intelligent Syst, Wuhan 430074, Peoples R China

来源：

NEUROCOMPUTING | 2023年 / 550卷

基金：

中国国家自然科学基金;

关键词：

Multimodal machine learning; Graph neural networks; Emotion recognition in conversation; Multimodal fusion;

D O I：

10.1016/j.neucom.2023.126427

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal machine learning is an emerging area of research, which has received a great deal of scholarly attention in recent years. Up to now, there are few studies on multimodal Emotion Recognition in Conversation (ERC). Since Graph Neural Networks (GNNs) possess the powerful capacity of relational modeling, they have an inherent advantage in the field of multimodal learning. GNNs leverage the graph constructed from multimodal data to perform intra- and inter-modal information interaction, which effectively facilitates the integration and complementation of multimodal data. In this work, we propose a novel Graph network based Multimodal Fusion Technique (GraphMFT) for emotion recognition in conversation. Multimodal data can be modeled as a graph, where each data object is regarded as a node, and both intra- and inter-modal dependencies existing between data objects can be regarded as edges. GraphMFT utilizes multiple improved graph attention networks to capture intra-modal contextual information and inter-modal complementary information. In addition, the proposed GraphMFT attempts to address the challenges of existing graph-based multimodal conversational emotion recognition models such as MMGCN. Empirical results on two public multimodal datasets reveal that our model outperforms the State-Of-The-Art (SOTA) approaches with the accuracy of 67.90% and 61.30%. & COPY; 2023 Elsevier B.V. All rights reserved.

引用

页数：11

共 41 条

[1] Multimodal Machine Learning: A Survey and Taxonomy
Baltrusaitis, Tadas
Ahuja, Chaitanya
Morency, Louis-Philippe
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
[2] Bagged support vector machines for emotion recognition from speech
Bhavan, Anjali
Chauhan, Pankaj
Hitkul
Shah, Rajiv Ratn
[J]. KNOWLEDGE-BASED SYSTEMS, 2019, 184
[3] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[4] Chen M., 2017, PROC 19 ACM INT C MU, P163, DOI DOI 10.1145/3136755.3136801
[5] Chen M, 2020, PR MACH LEARN RES, V119
[6] Over a decade of social opinion mining: a systematic review
Cortis, Keith
Davis, Brian
[J]. ARTIFICIAL INTELLIGENCE REVIEW, 2021, 54 (07) : 4873 - 4965
[7] Ephrat A, 2018, Arxiv, DOI arXiv:1804.03619
[8] Ghosal D, 2020, Arxiv, DOI arXiv:2010.02795
[9] Ghosal D, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P154
[10] Hamilton WL, 2017, ADV NEUR IN, V30

← 1 2 3 4 5 →