SMFNM: Semi-supervised multimodal fusion network with main-modal for real-time emotion recognition in conversations

被引：8

作者：

Yang, Juan ^{[1
]}

Dong, Xuanxiong ^{[1
]}

Du, Xu ^{[2
]}

机构：

[1] Wuhan Univ Sci & Technol, Coll Comp Sci & Technol, Wuhan 430065, Hubei, Peoples R China

[2] Cent China Normal Univ, Natl Engn Res Ctr E Learning, Wuhan 430079, Peoples R China

来源：

JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES | 2023年 / 35卷 / 09期

基金：

中国国家自然科学基金;

关键词：

Real-time Emotion recognition in conversations; Semi-supervised learning; Main modal; Multimodal interaction; Multimodal fusion network; LANGUAGE; SPEECH; SELECTION;

D O I：

10.1016/j.jksuci.2023.101791

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Real-time emotion recognition in conversations (ERC), which relies on only the historical utterances to achieve ERC, has recently gained increasing attention due to its significance in providing real-time empathetic services. Although utilizing multimodal information can mitigate the issues of unimodal approaches, few real-time ERC studies consider the differences in representation ability of different modalities and explore comprehensive conversational context from different perspectives based on different structures. Furthermore, the heavy annotation cost makes it difficult to collect sufficient labeled data, which also limits the performance of current supervised ERC approaches. To address these issues, we propose a novel framework SMFNM for real-time ERC, which integrates semi-supervised learning with multimodal fusion under the guidance of main-modal. Specifically, SMFNM utilizes additional unlabeled data to extract high-quality intra-modal representations, and implements cross-modal interaction to capture complementary information to enhance the audio representations. Then SMFNM employs the directed acyclic graph and the Gated Recurrent Units for exploring more accurate conversational context from both the multimodal and main-modal perspectives, respectively. Finally, these two types of contextual features are fused for emotion identification. Extensive experiments on benchmark datasets (i.e., IEMOCAP (4-way), IEMOCAP (6-way) and MELD) demonstrate the effectiveness, superiority and rationality of our SMFNM.(c) 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

引用

页数：16

共 60 条

[1] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[2] Attention-based label consistency for semi-supervised deep learning based image classification [J].

Chen, Jiaming ;

Yang, Meng ;

Ling, Jie .

NEUROCOMPUTING, 2021, 453 :731-741

[3] Semi-supervised learning for facial expression-based emotion recognition in the continuous domain [J].

Choi, Dong Yoon ;

Song, Byung Cheol .

MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (37-38) :28169-28187

[4] Semisupervised Autoencoders for Speech Emotion Recognition [J].

Deng, Jun ;

Xu, Xinzhou ;

Zhang, Zixing ;

Fruehholz, Sascha ;

Schuller, Bjorn .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (01) :31-43

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6]

Eyben Florian., 2010, Proceedings of the International Conference on Multimedia, P1459

[7] Exploring wav2vec 2.0 on speaker verification and language identification [J].

Fan, Zhiyun ;

Li, Meng ;

Zhou, Shiyu ;

Xu, Bo .

INTERSPEECH 2021, 2021, :1509-1513

[8] How language shapes the cultural inheritance of categories [J].

Gelman, Susan A. ;

Roberts, Steven O. .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2017, 114 (30) :7900-7907

[9]

Ghosal D, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P154

[10] Emotion Generation and Emotion Regulation: One or Two Depends on Your Point of View [J].

Gross, James J. ;

Barrett, Lisa Feldman .

EMOTION REVIEW, 2011, 3 (01) :8-16

← 1 2 3 4 5 6 →