GraphMFT: A graph network based multimodal fusion technique for emotion recognition in conversation

被引:20
作者
Li, Jiang [1 ,2 ,3 ]
Wang, Xiaoping [1 ,2 ,3 ]
Lv, Guoqing [1 ,2 ,3 ]
Zeng, Zhigang [1 ,2 ,3 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Wuhan 430074, Peoples R China
[2] Huazhong Univ Sci & Technol, Key Lab Image Proc & Intelligent Control, Educ Minist China, Wuhan 430074, Peoples R China
[3] Huazhong Univ Sci & Technol, Hubei Key Lab Brain Inspired Intelligent Syst, Wuhan 430074, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal machine learning; Graph neural networks; Emotion recognition in conversation; Multimodal fusion;
D O I
10.1016/j.neucom.2023.126427
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal machine learning is an emerging area of research, which has received a great deal of scholarly attention in recent years. Up to now, there are few studies on multimodal Emotion Recognition in Conversation (ERC). Since Graph Neural Networks (GNNs) possess the powerful capacity of relational modeling, they have an inherent advantage in the field of multimodal learning. GNNs leverage the graph constructed from multimodal data to perform intra- and inter-modal information interaction, which effectively facilitates the integration and complementation of multimodal data. In this work, we propose a novel Graph network based Multimodal Fusion Technique (GraphMFT) for emotion recognition in conversation. Multimodal data can be modeled as a graph, where each data object is regarded as a node, and both intra- and inter-modal dependencies existing between data objects can be regarded as edges. GraphMFT utilizes multiple improved graph attention networks to capture intra-modal contextual information and inter-modal complementary information. In addition, the proposed GraphMFT attempts to address the challenges of existing graph-based multimodal conversational emotion recognition models such as MMGCN. Empirical results on two public multimodal datasets reveal that our model outperforms the State-Of-The-Art (SOTA) approaches with the accuracy of 67.90% and 61.30%. & COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:11
相关论文
共 41 条
  • [1] Multimodal Machine Learning: A Survey and Taxonomy
    Baltrusaitis, Tadas
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
  • [2] Bagged support vector machines for emotion recognition from speech
    Bhavan, Anjali
    Chauhan, Pankaj
    Hitkul
    Shah, Rajiv Ratn
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 184
  • [3] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [4] Chen M., 2017, PROC 19 ACM INT C MU, P163, DOI DOI 10.1145/3136755.3136801
  • [5] Chen M, 2020, PR MACH LEARN RES, V119
  • [6] Over a decade of social opinion mining: a systematic review
    Cortis, Keith
    Davis, Brian
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2021, 54 (07) : 4873 - 4965
  • [7] Ephrat A, 2018, Arxiv, DOI arXiv:1804.03619
  • [8] Ghosal D, 2020, Arxiv, DOI arXiv:2010.02795
  • [9] Ghosal D, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P154
  • [10] Hamilton WL, 2017, ADV NEUR IN, V30