Bi-stream graph learning based multimodal fusion for emotion recognition in conversation

被引:3
作者
Lu, Nannan [1 ]
Han, Zhiyuan [1 ]
Han, Min [2 ]
Qian, Jiansheng [1 ]
机构
[1] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221000, Peoples R China
[2] Dalian Univ Technol, Sch Control Sci & Engn, Dalian, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Emotion recognition in conversation; Multimodal fusion; Graph neural networks; Contextual information; Inter-modal interaction;
D O I
10.1016/j.inffus.2024.102272
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion Recognition in Conversation (ERC) is the process of automatically detecting and understanding emotions expressed in a conversation, which plays an important role in human-computer interaction. A conversation generates different modality data including words, tone of voice and facial expression. Multimodal ERC can fuse the information from multiple views to comprehensively model emotion dynamics in a conversation. Graph Neural Networks (GNNs) are employed by multimodal ERC to learn intra-modal longrange contextual information and inter -modal interaction. However, fusing different modalities within a graph may generate the conflict of multimodal information and suffer from data heterogeneity issue. In the paper, we propose a novel Bi-stream Graph Learning based Multimodal Fusion (BiGMF) approach for ERC. It consists of a unimodal stream graph learning for modeling the intra-modal long-range context information and a crossmodal stream graph learning for modeling the inter -modal interactions, which uses GNNs to learn the intraand inter -modal information in parallel. The separation learning scheme can successfully alleviate the conflict and heterogeneity in multimodal data fusion, and promote the explicitly modeling of cross -modal relations. The experimental results on two public datasets further verify that the superiority of the proposed approach compared to the SOTA approaches.
引用
收藏
页数:13
相关论文
共 45 条
  • [1] Beal J., 2020, arXiv, DOI DOI 10.48550/ARXIV.2012.09958
  • [2] Beltagy I, 2020, Arxiv, DOI arXiv:2004.05150
  • [3] Bhavan A., 2019, Bagged Support Vector Machines for Emotion Recognition from Speech, V184
  • [4] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [5] Chatterjee A., 2019, P 13 INT WORKSHOP SE, P39
  • [6] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [7] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
  • [8] Coding, analysis, interpretation, and recognition of facial expressions
    Essa, IA
    Pentland, AP
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (07) : 757 - 763
  • [9] Ghosal D, 2019, arXiv
  • [10] Ghosal D, 2020, Arxiv, DOI arXiv:2010.02795