Bi-stream graph learning based multimodal fusion for emotion recognition in conversation

被引：3

作者：

Lu, Nannan ^{[1
]}

Han, Zhiyuan ^{[1
]}

Han, Min ^{[2
]}

Qian, Jiansheng ^{[1
]}

机构：

[1] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221000, Peoples R China

[2] Dalian Univ Technol, Sch Control Sci & Engn, Dalian, Peoples R China

来源：

INFORMATION FUSION | 2024年 / 106卷

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Emotion recognition in conversation; Multimodal fusion; Graph neural networks; Contextual information; Inter-modal interaction;

D O I：

10.1016/j.inffus.2024.102272

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion Recognition in Conversation (ERC) is the process of automatically detecting and understanding emotions expressed in a conversation, which plays an important role in human-computer interaction. A conversation generates different modality data including words, tone of voice and facial expression. Multimodal ERC can fuse the information from multiple views to comprehensively model emotion dynamics in a conversation. Graph Neural Networks (GNNs) are employed by multimodal ERC to learn intra-modal longrange contextual information and inter -modal interaction. However, fusing different modalities within a graph may generate the conflict of multimodal information and suffer from data heterogeneity issue. In the paper, we propose a novel Bi-stream Graph Learning based Multimodal Fusion (BiGMF) approach for ERC. It consists of a unimodal stream graph learning for modeling the intra-modal long-range context information and a crossmodal stream graph learning for modeling the inter -modal interactions, which uses GNNs to learn the intraand inter -modal information in parallel. The separation learning scheme can successfully alleviate the conflict and heterogeneity in multimodal data fusion, and promote the explicitly modeling of cross -modal relations. The experimental results on two public datasets further verify that the superiority of the proposed approach compared to the SOTA approaches.

引用

页数：13

共 45 条

[1] Beal J., 2020, arXiv, DOI DOI 10.48550/ARXIV.2012.09958
[2] Beltagy I, 2020, Arxiv, DOI arXiv:2004.05150
[3] Bhavan A., 2019, Bagged Support Vector Machines for Emotion Recognition from Speech, V184
[4] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[5] Chatterjee A., 2019, P 13 INT WORKSHOP SE, P39
[6] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[7] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[8] Coding, analysis, interpretation, and recognition of facial expressions
Essa, IA
Pentland, AP
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (07) : 757 - 763
[9] Ghosal D, 2019, arXiv
[10] Ghosal D, 2020, Arxiv, DOI arXiv:2010.02795

← 1 2 3 4 5 →