Topics Guided Multimodal Fusion Network for Conversational Emotion Recognition

被引：0

作者：

Yuan, Peicong ^{[1
]}

Cai, Guoyong ^{[1
]}

Chen, Ming ^{[1
]}

Tang, Xiaolv ^{[1
]}

机构：

[1] Guilin Univ Elect Technol, Guilin, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024 | 2024年 / 14877卷

关键词：

Emotion Recognition in Conversation; Neural Topic Model; Multimodal Fusion;

D O I：

10.1007/978-981-97-5669-8_21

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion Recognition in Conversation (ERC) is a very challenging task. Previous methods capture the semantic dependencies between utterances through complex conversational context modeling, ignoring the impact of the topic information contained in the utterances; furthermore, the commonality of multimodal information has not been effectively explored. To this end, the Topics Guided Multimodal Fusion Network (TGMFN) is proposed to extract effective utterance topic information and explore cross-modal commonality and complementarity to improve model performance. First, the VAE-based neural topic model is used to build a conversational topic model, and a new topic sampling strategy is designed that is different from the traditional reparameterization trick so that the topic modeling is more suitable for utterances. Second, a facial feature extraction method in multi-party conversations is proposed to extract rich facial features in the video. Finally, the Topic-Guided Vision-Audio features Aware fusion (TGV2A) module is designed based on the conversation topic, which fully fuses modal information such as the speaker's facial feature and topic-related co-occurrence information, and captures the commonality and complementarity between multimodal information to improve feature-semantic richness. Extensive experiments have been conducted on two multimodal ERC datasets IEMOCAP and MELD. Experimental results indicate that the proposed TGMFN model shows superior performance over the leading baseline methods.

引用

页码：250 / 262

页数：13

共 27 条

[1] Bao Y., 2022, arXiv
[2] Latent Dirichlet allocation
Blei, DM
Ng, AY
Jordan, MI
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
[3] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[4] Card D, 2018, Arxiv, DOI [arXiv:1705.09296, 10.48550/arXiv.1705.09296, DOI 10.48550/ARXIV.1705.09296]
[5] Topic Modeling in Embedding Spaces
Dieng, Adji B.
Ruiz, Francisco J. R.
Blei, David M.
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 (439-453) : 439 - 453
[6] Fusing pairwise modalities for emotion recognition in conversations
Fan, Chunxiao
Lin, Jie
Mao, Rui
Cambria, Erik
[J]. INFORMATION FUSION, 2024, 106
[7] Context reinforced neural topic modeling over short texts
Feng, Jiachun
Zhang, Zusheng
Ding, Cheng
Rao, Yanghui
Xie, Haoran
Wang, Fu Lee
[J]. INFORMATION SCIENCES, 2022, 607 : 79 - 91
[8] Ghosal D, 2020, Arxiv, DOI arXiv:2010.02795
[9] Ghosal D, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P154
[10] Hazarika D, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P2594

← 1 2 3 →