Multimodal emotion recognition based on audio and text by using hybrid attention networks

被引:25
|
作者
Zhang, Shiqing [1 ]
Yang, Yijiao [1 ,2 ]
Chen, Chen [1 ]
Liu, Ruixin [1 ,2 ]
Tao, Xin [1 ]
Guo, Wenping [1 ]
Xu, Yicheng [3 ]
Zhao, Xiaoming [1 ]
机构
[1] Taizhou Univ, Inst Intelligent Informat Proc, Taizhou 318000, Zhejiang, Peoples R China
[2] Zhejiang Univ Sci & Technol, Sch Sci, Hangzhou 310023, Zhejiang, Peoples R China
[3] Taizhou Vocat & Tech Coll, Sch Informat Technol Engn, Taizhou 318000, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal emotion recognition; Deep learning; Local intra-modal attention; Cross-modal attention; Global inter-modal attention; NEURAL-NETWORKS; SPEECH; FEATURES;
D O I
10.1016/j.bspc.2023.105052
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Multimodal Emotion Recognition (MER) has recently become a popular and challenging topic. The most key challenge in MER is how to effectively fuse multimodal information. Most of prior works may not fully consider the inter-modal and intra-modal attention mechanism to jointly learn intra-modal and inter-modal emotional salient information for further improving the performance of MER. To address this problem, this paper proposes a new MER framework based on audio and text by using Hybrid Attention Networks (MER-HAN). The proposed MER-HAN combines three different attention mechanisms such as the local intra-modal attention, the cross-modal attention, and the global inter-modal attention to effectively learn intra-modal and inter-modal emotional salient features for MER. Specifically, an Audio and Text Encoder (ATE) block equipped with deep learning techniques with the local intra-modal attention mechanism is initially designed to learn high-level audio and text feature representations from the corresponding audio and text sequences, respectively. Then, a Cross-Modal Attention (CMA) block is presented to jointly capture high-level shared feature representations across audio and text modalities. Finally, a Multimodal Emotion Classification (MEC) block with the global inter-modal attention mechanism is provided to obtain final MER results. Extensive experiments conducted on two public multimodal emotional datasets, i.e., IEMOCAP and MELD datasets, show the advantage of the proposed MER-HAN on MER tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text
    Lee, Yoonhyung
    Yoon, Seunghyun
    Jung, Kyomin
    INTERSPEECH 2020, 2020, : 2717 - 2721
  • [2] MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT
    Yoon, Seunghyun
    Byun, Seokhyun
    Jung, Kyomin
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 112 - 118
  • [3] Multimodal Emotion Recognition Using Transfer Learning on Audio and Text Data
    Deng, James J.
    Leung, Clement H. C.
    Li, Yuanxi
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS, ICCSA 2021, PT III, 2021, 12951 : 552 - 563
  • [4] A multimodal hierarchical approach to speech emotion recognition from audio and text
    Singh, Prabhav
    Srivastava, Ridam
    Rana, K. P. S.
    Kumar, Vineet
    KNOWLEDGE-BASED SYSTEMS, 2021, 229
  • [5] Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
    Zhang, Shiqing
    Yang, Yijiao
    Chen, Chen
    Zhang, Xingnan
    Leng, Qingming
    Zhao, Xiaoming
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
  • [6] Audio-Visual Attention Networks for Emotion Recognition
    Lee, Jiyoung
    Kim, Sunok
    Kim, Seungryong
    Sohn, Kwanghoon
    AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA, 2018, : 27 - 32
  • [7] Audio and Video Bimodal Emotion Recognition in Social Networks Based on Improved AlexNet Network and Attention Mechanism
    Liu, Min
    Tang, Jun
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2021, 17 (04): : 754 - 771
  • [8] A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
    Lian, Hailun
    Lu, Cheng
    Li, Sunan
    Zhao, Yan
    Tang, Chuangao
    Zong, Yuan
    ENTROPY, 2023, 25 (10)
  • [9] Group Gated Fusion on Attention-based Bidirectional Alignment for Multimodal Emotion Recognition
    Liu, Pengfei
    Li, Kun
    Meng, Helen
    INTERSPEECH 2020, 2020, : 379 - 383
  • [10] End-to-End Multimodal Emotion Recognition Using Deep Neural Networks
    Tzirakis, Panagiotis
    Trigeorgis, George
    Nicolaou, Mihalis A.
    Schuller, Bjorn W.
    Zafeiriou, Stefanos
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1301 - 1309