Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引:36
作者
Liu, Yang [1 ]
Sun, Haoqin [1 ]
Guan, Wenbo [1 ]
Xia, Yuqi [1 ]
Zhao, Zhen [1 ]
机构
[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China
关键词
Speech emotion recognition; Utterance-level contextual information; Multi-scale fusion framework; NEURAL-NETWORKS;
D O I
10.1016/j.specom.2022.02.006
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.
引用
收藏
页码:1 / 9
页数:9
相关论文
共 47 条
[11]   Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network [J].
Gharavian, Davood ;
Sheikhan, Mansour ;
Nazerieh, Alireza ;
Garoucy, Sahar .
NEURAL COMPUTING & APPLICATIONS, 2012, 21 (08) :2115-2126
[12]  
Hazarika Devamanyu, 2018, Proc Conf, V2018, P2122, DOI 10.18653/v1/n18-1193
[13]  
Heusser V., 2019, ARXIVABS191202610
[14]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[15]   Emotion recognition using deep learning approach from audio-visual emotional big data [J].
Hossain, M. Shamim ;
Muhammad, Ghulam .
INFORMATION FUSION, 2019, 49 :69-78
[16]   Affect Recognition using Key Frame Selection based on Minimum Sparse Reconstruction [J].
Kayaoglu, Mehmet ;
Erdem, Cigdem Eroglu .
ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, :519-524
[17]  
Kingma DP, 2014, ADV NEUR IN, V27
[18]   Toward detecting emotions in spoken dialogs [J].
Lee, CM ;
Narayanan, SS .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2005, 13 (02) :293-303
[19]  
Liu Z.T., 2017, NEUROCOMPUTING, V273
[20]   Classifier Fusion With Contextual Reliability Evaluation [J].
Liu, Zhunga ;
Pan, Quan ;
Dezert, Jean ;
Han, Jun-Wei ;
He, You .
IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (05) :1605-1618