Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引：36

作者：

Liu, Yang ^{[1
]}

Sun, Haoqin ^{[1
]}

Guan, Wenbo ^{[1
]}

Xia, Yuqi ^{[1
]}

Zhao, Zhen ^{[1
]}

机构：

[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China

来源：

SPEECH COMMUNICATION | 2022年 / 139卷

关键词：

Speech emotion recognition; Utterance-level contextual information; Multi-scale fusion framework; NEURAL-NETWORKS;

D O I：

10.1016/j.specom.2022.02.006

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.

引用

页码：1 / 9

页数：9

共 47 条

[11] Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network [J].

Gharavian, Davood ;

Sheikhan, Mansour ;

Nazerieh, Alireza ;

Garoucy, Sahar .

NEURAL COMPUTING & APPLICATIONS, 2012, 21 (08) :2115-2126

[12]

Hazarika Devamanyu, 2018, Proc Conf, V2018, P2122, DOI 10.18653/v1/n18-1193

[13]

Heusser V., 2019, ARXIVABS191202610

[14]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[15] Emotion recognition using deep learning approach from audio-visual emotional big data [J].

Hossain, M. Shamim ;

Muhammad, Ghulam .

INFORMATION FUSION, 2019, 49 :69-78

[16] Affect Recognition using Key Frame Selection based on Minimum Sparse Reconstruction [J].

Kayaoglu, Mehmet ;

Erdem, Cigdem Eroglu .

ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, :519-524

[17]

Kingma DP, 2014, ADV NEUR IN, V27

[18] Toward detecting emotions in spoken dialogs [J].

Lee, CM ;

Narayanan, SS .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2005, 13 (02) :293-303

[19]

Liu Z.T., 2017, NEUROCOMPUTING, V273

[20] Classifier Fusion With Contextual Reliability Evaluation [J].

Liu, Zhunga ;

Pan, Quan ;

Dezert, Jean ;

Han, Jun-Wei ;

He, You .

IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (05) :1605-1618

← 1 2 3 4 5 →