Multimodal emotion recognition based on audio and text by using hybrid attention networks

被引:25
|
作者
Zhang, Shiqing [1 ]
Yang, Yijiao [1 ,2 ]
Chen, Chen [1 ]
Liu, Ruixin [1 ,2 ]
Tao, Xin [1 ]
Guo, Wenping [1 ]
Xu, Yicheng [3 ]
Zhao, Xiaoming [1 ]
机构
[1] Taizhou Univ, Inst Intelligent Informat Proc, Taizhou 318000, Zhejiang, Peoples R China
[2] Zhejiang Univ Sci & Technol, Sch Sci, Hangzhou 310023, Zhejiang, Peoples R China
[3] Taizhou Vocat & Tech Coll, Sch Informat Technol Engn, Taizhou 318000, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal emotion recognition; Deep learning; Local intra-modal attention; Cross-modal attention; Global inter-modal attention; NEURAL-NETWORKS; SPEECH; FEATURES;
D O I
10.1016/j.bspc.2023.105052
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Multimodal Emotion Recognition (MER) has recently become a popular and challenging topic. The most key challenge in MER is how to effectively fuse multimodal information. Most of prior works may not fully consider the inter-modal and intra-modal attention mechanism to jointly learn intra-modal and inter-modal emotional salient information for further improving the performance of MER. To address this problem, this paper proposes a new MER framework based on audio and text by using Hybrid Attention Networks (MER-HAN). The proposed MER-HAN combines three different attention mechanisms such as the local intra-modal attention, the cross-modal attention, and the global inter-modal attention to effectively learn intra-modal and inter-modal emotional salient features for MER. Specifically, an Audio and Text Encoder (ATE) block equipped with deep learning techniques with the local intra-modal attention mechanism is initially designed to learn high-level audio and text feature representations from the corresponding audio and text sequences, respectively. Then, a Cross-Modal Attention (CMA) block is presented to jointly capture high-level shared feature representations across audio and text modalities. Finally, a Multimodal Emotion Classification (MEC) block with the global inter-modal attention mechanism is provided to obtain final MER results. Extensive experiments conducted on two public multimodal emotional datasets, i.e., IEMOCAP and MELD datasets, show the advantage of the proposed MER-HAN on MER tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] Multimodal text-emoji fusion using deep neural networks for text-based emotion detection in online communication
    Kusal, Sheetal
    Patil, Shruti
    Kotecha, Ketan
    JOURNAL OF BIG DATA, 2025, 12 (01)
  • [42] Enhancing Cross-Language Multimodal Emotion Recognition With Dual Attention Transformers
    Zaidi, Syed Aun Muhammad
    Latif, Siddique
    Qadir, Junaid
    IEEE OPEN JOURNAL OF THE COMPUTER SOCIETY, 2024, 5 : 684 - 693
  • [43] Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism
    Mountzouris, Konstantinos
    Perikos, Isidoros
    Hatzilygeroudis, Ioannis
    Corchado, Juan M.
    Iglesias, Carlos A.
    Kim, Byung-Gyu
    Mehmood, Rashid
    Ren, Fuji
    Lee, In
    ELECTRONICS, 2023, 12 (20)
  • [44] MSER: Multimodal speech emotion recognition using cross-attention with deep fusion
    Khan, Mustaqeem
    Gueaieb, Wail
    El Saddik, Abdulmotaleb
    Kwon, Soonil
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 245
  • [45] A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism
    Lieskovska, Eva
    Jakubec, Maros
    Jarina, Roman
    Chmulik, Michal
    ELECTRONICS, 2021, 10 (10)
  • [46] Multimodal Emotion Recognition Using Deep Generalized Canonical Correlation Analysis with an Attention Mechanism
    Lan, Yu-Ting
    Liu, Wei
    Lu, Bao-Liang
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [47] EEG-based multimodal emotion recognition with optimal trained hybrid classifier
    Chakravarthy, G. Kalyana
    Suchithra, M.
    Thatavarti, Satish
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (17) : 50133 - 50156
  • [48] MULTIMODAL TRANSFORMER WITH LEARNABLE FRONTEND AND SELF ATTENTION FOR EMOTION RECOGNITION
    Dutta, Soumya
    Ganapathy, Sriram
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6917 - 6921
  • [49] Speech Emotion Recognition Using Audio Matching
    Chaturvedi, Iti
    Noel, Tim
    Satapathy, Ranjan
    ELECTRONICS, 2022, 11 (23)
  • [50] Emotion Recognition of College Students Based on Audio and Video Image
    Zhu, Chenjie
    Ding, Ting
    Min, Xue
    TRAITEMENT DU SIGNAL, 2022, 39 (05) : 1475 - 1481