Multimodal emotion recognition based on audio and text by using hybrid attention networks

被引:25
|
作者
Zhang, Shiqing [1 ]
Yang, Yijiao [1 ,2 ]
Chen, Chen [1 ]
Liu, Ruixin [1 ,2 ]
Tao, Xin [1 ]
Guo, Wenping [1 ]
Xu, Yicheng [3 ]
Zhao, Xiaoming [1 ]
机构
[1] Taizhou Univ, Inst Intelligent Informat Proc, Taizhou 318000, Zhejiang, Peoples R China
[2] Zhejiang Univ Sci & Technol, Sch Sci, Hangzhou 310023, Zhejiang, Peoples R China
[3] Taizhou Vocat & Tech Coll, Sch Informat Technol Engn, Taizhou 318000, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal emotion recognition; Deep learning; Local intra-modal attention; Cross-modal attention; Global inter-modal attention; NEURAL-NETWORKS; SPEECH; FEATURES;
D O I
10.1016/j.bspc.2023.105052
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Multimodal Emotion Recognition (MER) has recently become a popular and challenging topic. The most key challenge in MER is how to effectively fuse multimodal information. Most of prior works may not fully consider the inter-modal and intra-modal attention mechanism to jointly learn intra-modal and inter-modal emotional salient information for further improving the performance of MER. To address this problem, this paper proposes a new MER framework based on audio and text by using Hybrid Attention Networks (MER-HAN). The proposed MER-HAN combines three different attention mechanisms such as the local intra-modal attention, the cross-modal attention, and the global inter-modal attention to effectively learn intra-modal and inter-modal emotional salient features for MER. Specifically, an Audio and Text Encoder (ATE) block equipped with deep learning techniques with the local intra-modal attention mechanism is initially designed to learn high-level audio and text feature representations from the corresponding audio and text sequences, respectively. Then, a Cross-Modal Attention (CMA) block is presented to jointly capture high-level shared feature representations across audio and text modalities. Finally, a Multimodal Emotion Classification (MEC) block with the global inter-modal attention mechanism is provided to obtain final MER results. Extensive experiments conducted on two public multimodal emotional datasets, i.e., IEMOCAP and MELD datasets, show the advantage of the proposed MER-HAN on MER tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Hybrid Approach for Emotion Classification of Audio Conversation Based on Text and Speech Mining
    Bhaskar, Jasmine
    Sruthi, K.
    Nedungadi, Prema
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES, ICICT 2014, 2015, 46 : 635 - 643
  • [32] AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition
    Das, Avishek
    Sarma, Moumita Sen
    Hoque, Mohammed Moshiul
    Siddique, Nazmul
    Dewan, M. Ali Akber
    SENSORS, 2024, 24 (18)
  • [33] Cross-Subject Multimodal Emotion Recognition Based on Hybrid Fusion
    Cimtay, Yucel
    Ekmekcioglu, Erhan
    Caglar-Ozhan, Seyma
    IEEE ACCESS, 2020, 8 : 168865 - 168878
  • [34] Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review
    Udahemuka, Gustave
    Djouani, Karim
    Kurien, Anish M.
    APPLIED SCIENCES-BASEL, 2024, 14 (17):
  • [35] A novel dual attention-based BLSTM with hybrid features in speech emotion recognition
    Chen, Qiupu
    Huang, Guimin
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 102
  • [36] Emotion Recognition Based on Feedback Weighted Fusion of Multimodal Emotion Data
    Wei, Wei
    Jia, Qingxuan
    Feng, Yongli
    2017 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (IEEE ROBIO 2017), 2017, : 1682 - 1687
  • [37] Multimodal Arabic emotion recognition using deep learning
    Al Roken, Noora
    Barlas, Gerassimos
    SPEECH COMMUNICATION, 2023, 155
  • [38] Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN
    Ma, Fei
    Li, Yang
    Ni, Shiguang
    Huang, Shao-Lun
    Zhang, Lin
    APPLIED SCIENCES-BASEL, 2022, 12 (01):
  • [39] Emotion Recognition Using Multimodal Approach
    Saini, Samiksha
    Rao, Rohan
    Vaichole, Vinit
    Rane, Anand
    Abin, Deepa
    2018 FOURTH INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION (ICCUBEA), 2018,
  • [40] Cross-Corpus Speech Emotion Recognition Based on Hybrid Neural Networks
    Rehman, Abdul
    Liu, Zhen-Tao
    Li, Dan-Yun
    Wu, Bao-Han
    PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 7464 - 7468