Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引：36

作者：

Liu, Yang ^{[1
]}

Sun, Haoqin ^{[1
]}

Guan, Wenbo ^{[1
]}

Xia, Yuqi ^{[1
]}

Zhao, Zhen ^{[1
]}

机构：

[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China

来源：

SPEECH COMMUNICATION | 2022年 / 139卷

关键词：

Speech emotion recognition; Utterance-level contextual information; Multi-scale fusion framework; NEURAL-NETWORKS;

D O I：

10.1016/j.specom.2022.02.006

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.

引用

页码：1 / 9

页数：9

共 47 条

[1]

[Anonymous], 2014, EMPIRICAL EVALUATION

[2] Solving the emotion paradox: Categorization and the experience of emotion [J].

Barrett, LF .

PERSONALITY AND SOCIAL PSYCHOLOGY REVIEW, 2006, 10 (01) :20-46

[3] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[4]

Chen M., 2020, MULTISCALE FUSION FR, P374

[5] Deep neural networks for emotion recognition combining audio and transcripts [J].

Cho, Jaejin ;

Pappagari, Raghavendra ;

Kulkarni, Purva ;

Villalba, Jesus ;

Carmiel, Yishay ;

Dehak, Najim .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :247-251

[6]

Demircan S., 2016, NEURAL COMPUT APPL

[7]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[8]

Eyben F., 2010, P 18 ACM INT C MULT, P1459

[9] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing [J].

Eyben, Florian ;

Scherer, Klaus R. ;

Schuller, Bjoern W. ;

Sundberg, Johan ;

Andre, Elisabeth ;

Busso, Carlos ;

Devillers, Laurence Y. ;

Epps, Julien ;

Laukka, Petri ;

Narayanan, Shrikanth S. ;

Truong, Khiet P. .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) :190-202

[10] Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition [J].

Farhoudi, Zeinab ;

Setayeshi, Saeed .

SPEECH COMMUNICATION, 2021, 127 :92-103

← 1 2 3 4 5 →