Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引:27
|
作者
Liu, Yang [1 ]
Sun, Haoqin [1 ]
Guan, Wenbo [1 ]
Xia, Yuqi [1 ]
Zhao, Zhen [1 ]
机构
[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China
关键词
Speech emotion recognition; Utterance-level contextual information; Multi-scale fusion framework; NEURAL-NETWORKS;
D O I
10.1016/j.specom.2022.02.006
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.
引用
收藏
页码:1 / 9
页数:9
相关论文
共 50 条
  • [31] MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION
    Sun, Licai
    Liu, Bin
    Tao, Jianhua
    Lian, Zheng
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4275 - 4279
  • [32] Audio-Visual Emotion Recognition System Using Multi-Modal Features
    Handa, Anand
    Agarwal, Rashi
    Kohli, Narendra
    INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE, 2021, 15 (04)
  • [33] A multi-modal deep learning system for Arabic emotion recognition
    Abu Shaqra F.
    Duwairi R.
    Al-Ayyoub M.
    International Journal of Speech Technology, 2023, 26 (01) : 123 - 139
  • [34] Speech emotion recognition based on multi‐feature and multi‐lingual fusion
    Chunyi Wang
    Ying Ren
    Na Zhang
    Fuwei Cui
    Shiying Luo
    Multimedia Tools and Applications, 2022, 81 : 4897 - 4907
  • [35] Multi-algorithm Fusion for Speech Emotion Recognition
    Verma, Gyanendra K.
    Tiwary, U. S.
    Agrawal, Shaishav
    ADVANCES IN COMPUTING AND COMMUNICATIONS, PT III, 2011, 192 : 452 - 459
  • [36] GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition*
    Ye, Jia-Xin
    Wen, Xin-Cheng
    Wang, Xuan-Ze
    Xu, Yong
    Luo, Yan
    Wu, Chang-Li
    Chen, Li-Yan
    Liu, Kun-Hong
    SPEECH COMMUNICATION, 2022, 145 : 21 - 35
  • [37] Speech Emotion Recognition Using Multi-granularity Feature Fusion Through Auditory Cognitive Mechanism
    Xu, Cong
    Li, Haifeng
    Bo, Hongjian
    Ma, Lin
    COGNITIVE COMPUTING - ICCC 2019, 2019, 11518 : 117 - 131
  • [38] SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified Datasets and Multitask Learning
    Kang, Zuheng
    Peng, Junqing
    Wang, Jianzong
    Xiao, Jing
    INTERSPEECH 2022, 2022, : 4745 - 4749
  • [39] Multi-Stream Convolution-Recurrent Neural Networks Based on Attention Mechanism Fusion for Speech Emotion Recognition
    Tao, Huawei
    Geng, Lei
    Shan, Shuai
    Mai, Jingchao
    Fu, Hongliang
    ENTROPY, 2022, 24 (08)
  • [40] A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism
    Lieskovska, Eva
    Jakubec, Maros
    Jarina, Roman
    Chmulik, Michal
    ELECTRONICS, 2021, 10 (10)