Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引：27

作者：

Liu, Yang ^{[1
]}

Sun, Haoqin ^{[1
]}

Guan, Wenbo ^{[1
]}

Xia, Yuqi ^{[1
]}

Zhao, Zhen ^{[1
]}

机构：

[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China

来源：

SPEECH COMMUNICATION | 2022年 / 139卷

关键词：

Speech emotion recognition; Utterance-level contextual information; Multi-scale fusion framework; NEURAL-NETWORKS;

D O I：

10.1016/j.specom.2022.02.006

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.

引用

页码：1 / 9

页数：9

共 50 条

[41] Speech Emotion Recognition Using Multi-Scale Global-Local Representation Learning with Feature Pyramid Network
Wang, Yuhua
Huang, Jianxing
Zhao, Zhengdao
Lan, Haiyan
Zhang, Xinjia
APPLIED SCIENCES-BASEL, 2024, 14 (24):
[42] MBDA: A Multi-scale Bidirectional Perception Approach for Cross-Corpus Speech Emotion Recognition
Li, Jiayang
Wang, Xiaoye
Li, Siyuan
Shi, Jia
Xiao, Yingyuan
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024, 2024, 14877 : 329 - 341
[43] Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features
Santoso, Jennifer
Yamada, Takeshi
Ishizuka, Kenkichi
Hashimoto, Taiichi
Makino, Shoji
IEEE ACCESS, 2022, 10 : 115732 - 115743
[44] Enhancing speech emotion recognition: a deep learning approach with self-attention and acoustic features
Aghajani, Khadijeh
Zohrevandi, Mahbanou
JOURNAL OF SUPERCOMPUTING, 2025, 81 (05)
[45] Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition
Ahn, Chung-Soo
Kasun, L. L. Chamara
Sivadas, Sunil
Rajapakse, Jagath C.
INTERSPEECH 2022, 2022, : 744 - 748
[46] Speech emotion recognition based on multi-feature and multi-lingual fusion
Wang, Chunyi
Ren, Ying
Zhang, Na
Cui, Fuwei
Luo, Shiying
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (04) : 4897 - 4907
[47] Semantic Enhancement Network Integrating Label Knowledge for Multi-modal Emotion Recognition
Zheng, HongFeng
Miao, ShengFa
Yu, Qian
Mu, YongKang
Jin, Xin
Yan, KeShan
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14879 : 473 - 484
[48] Speech Emotion Recognition via Multi-Level Attention Network
Liu, Ke
Wang, Dekui
Wu, Dongya
Liu, Yutao
Feng, Jun
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2278 - 2282
[49] EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition
Gerczuk, Maurice
Amiriparian, Shahin
Ottl, Sandra
Schuller, Bjorn W. W.
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (02) : 1472 - 1487
[50] Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network
Ngoc-Huynh Ho
Yang, Hyung-Jeong
Kim, Soo-Hyung
Lee, Gueesang
IEEE ACCESS, 2020, 8 : 61672 - 61686

← 1 2 3 4 5 →