M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

被引：89

作者：

Chudasama, Vishal ^{[1
]}

Kar, Purbayan ^{[1
]}

Gudmalwar, Ashish ^{[1
]}

Shah, Nirmesh ^{[1
]}

Wasnik, Pankaj ^{[1
]}

Onoe, Naoyuki ^{[1
]}

机构：

[1] Sony Res India, Media Anal Grp, Bangalore, Karnataka, India

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022 | 2022年

关键词：

D O I：

10.1109/CVPRW56347.2022.00511

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC.

引用

页码：4651 / 4660

页数：10

共 38 条

[1]

Bardes Adrien, 2021, ARXIV210504906

[2] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[3]

Delbrouck Jean-Benoit, 2020, ABS200615955 CORR

[4]

Devillers L, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P801

[5]

Devlin Jacob, 2018, CoRR

[6]

Ghosal D, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P154

[7]

Ghosal Deepanway, 2020, ABS201002795 CORR

[8]

Hazarika Devamanyu, 2018, Proc Conf, V2018, P2122, DOI 10.18653/v1/n18-1193

[9] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[10]

Hu Dou, 2021, ARXIV210601978

← 1 2 3 4 →