Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引:26
|
作者
Liu, Yang [1 ]
Sun, Haoqin [1 ]
Guan, Wenbo [1 ]
Xia, Yuqi [1 ]
Zhao, Zhen [1 ]
机构
[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China
关键词
Speech emotion recognition; Utterance-level contextual information; Multi-scale fusion framework; NEURAL-NETWORKS;
D O I
10.1016/j.specom.2022.02.006
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.
引用
收藏
页码:1 / 9
页数:9
相关论文
共 50 条
  • [31] Self-attention for Speech Emotion Recognition
    Tarantino, Lorenzo
    Garner, Philip N.
    Lazaridis, Alexandros
    INTERSPEECH 2019, 2019, : 2578 - 2582
  • [32] Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking
    Wang, Rui
    Zhu, Jiawei
    Wang, Shoujin
    Wang, Tao
    Huang, Jingze
    Zhu, Xianxun
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (04)
  • [33] Multi-Scale Self-Attention for Text Classification
    Guo, Qipeng
    Qiu, Xipeng
    Liu, Pengfei
    Xue, Xiangyang
    Zhang, Zheng
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 7847 - 7854
  • [34] EEG Emotion Recognition Method Using Multi-Scale and Multi-Path Hybrid Attention Mechanism
    Gu, Xuejing
    Liu, Jia
    Guo, Yucheng
    Yang, Zhaohui
    Computer Engineering and Applications, 2024, 60 (19) : 130 - 138
  • [35] Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning
    Liu, Dong
    Wang, Zhiyong
    Wang, Lifeng
    Chen, Longxi
    FRONTIERS IN NEUROROBOTICS, 2021, 15
  • [36] Uniting Multi-Scale Local Feature Awareness and the Self-Attention Mechanism for Named Entity Recognition
    Shi, Lin
    Zou, Xianming
    Dai, Chenxu
    Ji, Zhanlin
    MATHEMATICS, 2023, 11 (11)
  • [37] Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction
    Chen, Shizhe
    Jin, Qin
    MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 571 - 575
  • [38] Dense Attention Memory Network for Multi-modal emotion recognition
    Ma, Gailing
    Guo, Xiao
    2022 5TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING, MLNLP 2022, 2022, : 48 - 53
  • [39] Multi-Modal Emotion Recognition Using Speech Features and Text-Embedding
    Byun, Sung-Woo
    Kim, Ju-Hee
    Lee, Seok-Pil
    APPLIED SCIENCES-BASEL, 2021, 11 (17):
  • [40] Multi-Stride Self-Attention for Speech Recognition
    Han, Kyu J.
    Huang, Jing
    Tang, Yun
    He, Xiaodong
    Zhou, Bowen
    INTERSPEECH 2019, 2019, : 2788 - 2792