Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引：27

作者：

Liu, Yang ^{[1
]}

Sun, Haoqin ^{[1
]}

Guan, Wenbo ^{[1
]}

Xia, Yuqi ^{[1
]}

Zhao, Zhen ^{[1
]}

机构：

[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China

来源：

SPEECH COMMUNICATION | 2022年 / 139卷

关键词：

Speech emotion recognition; Utterance-level contextual information; Multi-scale fusion framework; NEURAL-NETWORKS;

D O I：

10.1016/j.specom.2022.02.006

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.

引用

页码：1 / 9

页数：9

共 50 条

[31] MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION
Sun, Licai
Liu, Bin
Tao, Jianhua
Lian, Zheng
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4275 - 4279
[32] Audio-Visual Emotion Recognition System Using Multi-Modal Features
Handa, Anand
Agarwal, Rashi
Kohli, Narendra
INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE, 2021, 15 (04)
[33] A multi-modal deep learning system for Arabic emotion recognition
Abu Shaqra F.
Duwairi R.
Al-Ayyoub M.
International Journal of Speech Technology, 2023, 26 (01) : 123 - 139
[34] Speech emotion recognition based on multi‐feature and multi‐lingual fusion
Chunyi Wang
Ying Ren
Na Zhang
Fuwei Cui
Shiying Luo
Multimedia Tools and Applications, 2022, 81 : 4897 - 4907
[35] Multi-algorithm Fusion for Speech Emotion Recognition
Verma, Gyanendra K.
Tiwary, U. S.
Agrawal, Shaishav
ADVANCES IN COMPUTING AND COMMUNICATIONS, PT III, 2011, 192 : 452 - 459
[36] GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition*
Ye, Jia-Xin
Wen, Xin-Cheng
Wang, Xuan-Ze
Xu, Yong
Luo, Yan
Wu, Chang-Li
Chen, Li-Yan
Liu, Kun-Hong
SPEECH COMMUNICATION, 2022, 145 : 21 - 35
[37] Speech Emotion Recognition Using Multi-granularity Feature Fusion Through Auditory Cognitive Mechanism
Xu, Cong
Li, Haifeng
Bo, Hongjian
Ma, Lin
COGNITIVE COMPUTING - ICCC 2019, 2019, 11518 : 117 - 131
[38] SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified Datasets and Multitask Learning
Kang, Zuheng
Peng, Junqing
Wang, Jianzong
Xiao, Jing
INTERSPEECH 2022, 2022, : 4745 - 4749
[39] Multi-Stream Convolution-Recurrent Neural Networks Based on Attention Mechanism Fusion for Speech Emotion Recognition
Tao, Huawei
Geng, Lei
Shan, Shuai
Mai, Jingchao
Fu, Hongliang
ENTROPY, 2022, 24 (08)
[40] A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism
Lieskovska, Eva
Jakubec, Maros
Jarina, Roman
Chmulik, Michal
ELECTRONICS, 2021, 10 (10)

← 1 2 3 4 5 →