M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

被引:52
|
作者
Chudasama, Vishal [1 ]
Kar, Purbayan [1 ]
Gudmalwar, Ashish [1 ]
Shah, Nirmesh [1 ]
Wasnik, Pankaj [1 ]
Onoe, Naoyuki [1 ]
机构
[1] Sony Res India, Media Anal Grp, Bangalore, Karnataka, India
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022 | 2022年
关键词
D O I
10.1109/CVPRW56347.2022.00511
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC.
引用
收藏
页码:4651 / 4660
页数:10
相关论文
共 50 条
  • [1] M2FNet: Multi-modal fusion network for object detection from visible and thermal infrared images
    Jiang, Chenchen
    Ren, Huazhong
    Yang, Hong
    Huo, Hongtao
    Zhu, Pengfei
    Yao, Zhaoyuan
    Li, Jing
    Sun, Min
    Yang, Shihao
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 130
  • [2] M2fNet: Multi-modal Forest Monitoring Network on Large-scale Virtual Dataset
    Lu, Yawen
    Huang, Yunhan
    Sun, Su
    Zhang, Tansi
    Zhang, Xuewen
    Fei, Songlin
    Chen, Victor
    2024 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES ABSTRACTS AND WORKSHOPS, VRW 2024, 2024, : 539 - 543
  • [3] Listening and speaking knowledge fusion network for multi-modal emotion recognition in conversation
    Liu Q.
    Xie J.
    Hu Y.
    Hao S.-F.
    Hao Y.-H.
    Kongzhi yu Juece/Control and Decision, 2024, 39 (06): : 2031 - 2040
  • [4] Multi-modal fusion network with complementarity and importance for emotion recognition
    Liu, Shuai
    Gao, Peng
    Li, Yating
    Fu, Weina
    Ding, Weiping
    INFORMATION SCIENCES, 2023, 619 : 679 - 694
  • [5] M2FNet: Multi-granularity Feature Fusion Network for Medical Visual Question Answering
    Wang, He
    Pan, Haiwei
    Zhang, Kejia
    He, Shuning
    Chen, Chunling
    PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT II, 2022, 13630 : 141 - 154
  • [6] ATTENTION DRIVEN FUSION FOR MULTI-MODAL EMOTION RECOGNITION
    Priyasad, Darshana
    Fernando, Tharindu
    Denman, Simon
    Sridharan, Sridha
    Fookes, Clinton
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3227 - 3231
  • [7] Multi-modal Correlated Network for emotion recognition in speech
    Ren, Minjie
    Nie, Weizhi
    Liu, Anan
    Su, Yuting
    VISUAL INFORMATICS, 2019, 3 (03) : 150 - 155
  • [8] Semantic Alignment Network for Multi-Modal Emotion Recognition
    Hou, Mixiao
    Zhang, Zheng
    Liu, Chang
    Lu, Guangming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5318 - 5329
  • [9] Multi-Modal Fusion Emotion Recognition Based on HMM and ANN
    Xu, Chao
    Cao, Tianyi
    Feng, Zhiyong
    Dong, Caichao
    CONTEMPORARY RESEARCH ON E-BUSINESS TECHNOLOGY AND STRATEGY, 2012, 332 : 541 - 550
  • [10] Multi-modal emotion recognition in conversation based on prompt learning with text-audio fusion features
    Wu, Yuezhou
    Zhang, Siling
    Li, Pengfei
    SCIENTIFIC REPORTS, 2025, 15 (01):