Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引：27

作者：

Liu, Yang ^{[1
]}

Sun, Haoqin ^{[1
]}

Guan, Wenbo ^{[1
]}

Xia, Yuqi ^{[1
]}

Zhao, Zhen ^{[1
]}

机构：

[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China

来源：

SPEECH COMMUNICATION | 2022年 / 139卷

关键词：

Speech emotion recognition; Utterance-level contextual information; Multi-scale fusion framework; NEURAL-NETWORKS;

D O I：

10.1016/j.specom.2022.02.006

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.

引用

页码：1 / 9

页数：9

共 50 条

[21] Combining Gated Convolutional Networks and Self-Attention Mechanism for Speech Emotion Recognition
Li, Chao
Jiao, Jinlong
Zhao, Yiqin
Zhao, Ziping
2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS (ACIIW), 2019, : 105 - 109
[22] Multi-Modal Emotion Aware System Based on Fusion of Speech and Brain Information
Ghoniem, Rania M.
Algarni, Abeer D.
Shaalan, Khaled
INFORMATION, 2019, 10 (07)
[23] EEG emotion recognition approach using multi-scale convolution and feature fusion
Zhang, Yong
Shan, Qingguo
Chen, Wenyun
Liu, Wenzhe
VISUAL COMPUTER, 2025, 41 (06) : 4157 - 4169
[24] TLBT-Net: A Multi-scale Cross-fusion Model for Speech Emotion Recognition
Yu, Anli
Sun, Xuelian
Wu, Xiaoyang
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MODELING, NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING, CMNM 2024, 2024, : 245 - 250
[25] Speech emotion recognition using recurrent neural networks with directional self-attention
Li, Dongdong
Liu, Jinlin
Yang, Zhuo
Sun, Linyu
Wang, Zhe
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 173
[26] BAT: Block and token self-attention for speech emotion recognition
Lei, Jianjun
Zhu, Xiangwei
Wang, Ying
NEURAL NETWORKS, 2022, 156 : 67 - 80
[27] Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition
Zhao, Jingyu
Li, Ruwei
Tian, Maocun
An, Weidong
NEURAL PROCESSING LETTERS, 2024, 56 (04)
[28] Multi-Modal Emotion Recognition From Speech and Facial Expression Based on Deep Learning
Cai, Linqin
Dong, Jiangong
Wei, Min
2020 CHINESE AUTOMATION CONGRESS (CAC 2020), 2020, : 5726 - 5729
[29] A Multi-Modal Deep Learning Approach for Emotion Recognition
Shahzad, H. M.
Bhatti, Sohail Masood
Jaffar, Arfan
Rashid, Muhammad
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 36 (02) : 1561 - 1570
[30] SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings
Nhat Truong Pham
Duc Ngoc Minh Dang
Bich Ngoc Hong Pham
Sy Dzung Nguyen
PROCEEDINGS OF 2023 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION TECHNOLOGY, ICIIT 2023, 2023, : 234 - 238

← 1 2 3 4 5 →