Multimodal emotion recognition based on audio and text by using hybrid attention networks

被引:25
|
作者
Zhang, Shiqing [1 ]
Yang, Yijiao [1 ,2 ]
Chen, Chen [1 ]
Liu, Ruixin [1 ,2 ]
Tao, Xin [1 ]
Guo, Wenping [1 ]
Xu, Yicheng [3 ]
Zhao, Xiaoming [1 ]
机构
[1] Taizhou Univ, Inst Intelligent Informat Proc, Taizhou 318000, Zhejiang, Peoples R China
[2] Zhejiang Univ Sci & Technol, Sch Sci, Hangzhou 310023, Zhejiang, Peoples R China
[3] Taizhou Vocat & Tech Coll, Sch Informat Technol Engn, Taizhou 318000, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal emotion recognition; Deep learning; Local intra-modal attention; Cross-modal attention; Global inter-modal attention; NEURAL-NETWORKS; SPEECH; FEATURES;
D O I
10.1016/j.bspc.2023.105052
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Multimodal Emotion Recognition (MER) has recently become a popular and challenging topic. The most key challenge in MER is how to effectively fuse multimodal information. Most of prior works may not fully consider the inter-modal and intra-modal attention mechanism to jointly learn intra-modal and inter-modal emotional salient information for further improving the performance of MER. To address this problem, this paper proposes a new MER framework based on audio and text by using Hybrid Attention Networks (MER-HAN). The proposed MER-HAN combines three different attention mechanisms such as the local intra-modal attention, the cross-modal attention, and the global inter-modal attention to effectively learn intra-modal and inter-modal emotional salient features for MER. Specifically, an Audio and Text Encoder (ATE) block equipped with deep learning techniques with the local intra-modal attention mechanism is initially designed to learn high-level audio and text feature representations from the corresponding audio and text sequences, respectively. Then, a Cross-Modal Attention (CMA) block is presented to jointly capture high-level shared feature representations across audio and text modalities. Finally, a Multimodal Emotion Classification (MEC) block with the global inter-modal attention mechanism is provided to obtain final MER results. Extensive experiments conducted on two public multimodal emotional datasets, i.e., IEMOCAP and MELD datasets, show the advantage of the proposed MER-HAN on MER tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
    Middya, Asif Iqbal
    Nag, Baibhav
    Roy, Sarbani
    KNOWLEDGE-BASED SYSTEMS, 2022, 244
  • [22] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
    Middya, Asif Iqbal
    Nag, Baibhav
    Roy, Sarbani
    KNOWLEDGE-BASED SYSTEMS, 2022, 244
  • [23] AIA-Net: Adaptive Interactive Attention Network for Text-Audio Emotion Recognition
    Zhang, Tong
    Li, Shuzhen
    Chen, Bianna
    Yuan, Haozhang
    Chen, C. L. Philip
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (12) : 7659 - 7671
  • [24] Coordination Attention based Transformers with bidirectional contrastive loss for multimodal speech emotion recognition
    Fan, Weiquan
    Xu, Xiangmin
    Zhou, Guohua
    Deng, Xiaofang
    Xing, Xiaofen
    SPEECH COMMUNICATION, 2025, 169
  • [25] Using the Fisher Vector Representation for Audio-based Emotion Recognition
    Gosztolya, Gabor
    ACTA POLYTECHNICA HUNGARICA, 2020, 17 (06) : 7 - 23
  • [26] Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition
    Guo, Peini
    Chen, Zhengyan
    Li, Yidi
    Liu, Hong
    ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 315 - 326
  • [27] Audio-Visual Emotion Recognition Using a Hybrid Deep Convolutional Neural Network based on Census Transform
    Cornejo, Jadisha Yarif Ramirez
    Pedrini, Helio
    2019 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC), 2019, : 3396 - 3402
  • [28] Multimodal Emotion Recognition Based on Facial Expressions, Speech, and EEG
    Pan, Jiahui
    Fang, Weijie
    Zhang, Zhihang
    Chen, Bingzhi
    Zhang, Zheng
    Wang, Shuihua
    IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY, 2024, 5 : 396 - 403
  • [29] Multimodal Emotion Recognition via Convolutional Neural Networks: Comparison of different strategies on two multimodal datasets
    Bilotti, U.
    Bisogni, C.
    De Marsico, M.
    Tramonte, S.
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 130
  • [30] Adolescent Depression Detection Model Based on Multimodal Data of Interview Audio and Text
    Zhang, Lei
    Fan, Yuanxiao
    Jiang, Jingwen
    Li, Yuchen
    Zhang, Wei
    INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2022, 32 (11)