Speech Emotion Recognition with Multi-task Learning

被引：56

作者：

Cai, Xingyu ^{[1
]}

Yuan, Jiahong ^{[1
]}

Zheng, Renjie ^{[1
]}

Huang, Liang ^{[1
]}

Church, Kenneth ^{[1
]}

机构：

[1] Baidu Res, Sunnyvale, CA 94089 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

speech emotion recognition; multi-task learning; MODELS;

D O I：

10.21437/Interspeech.2021-1852

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Speech emotion recognition (SER) classifies speech into emotion categories such as: Happy, Angry, Sad and Neutral. Recently, deep learning has been applied to the SER task. This paper proposes a multi-task learning (MTL) framework to simultaneously perform speech-to-text recognition and emotion classification, with an end-to-end deep neural model based on wav2vec-2.0. Experiments on the IEMOCAP benchmark show that the proposed method achieves the state-of-the-art performance on the SER task. In addition, an ablation study establishes the effectiveness of the proposed MTL framework.

引用

页码：4508 / 4512

页数：5

共 41 条

[1] Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers [J].

Akcay, Mehmet Berkehan ;

Oguz, Kaya .

SPEECH COMMUNICATION, 2020, 116 (116) :56-76

[2]

[Anonymous], 2011, SPEECH COMMUNICATION, DOI DOI 10.3389/FNANA.2011.00046

[3]

Baevski A., 2020, Advances in Neural Information Processing Systems 33

[4]

Bhosale S, 2020, INT CONF ACOUST SPEE, P7189, DOI [10.1109/ICASSP40776.2020.9054621, 10.1109/icassp40776.2020.9054621]

[5] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[6] Emotion recognition in human-computer interaction [J].

Cowie, R ;

Douglas-Cowie, E ;

Tsapatsoulis, N ;

Votsis, G ;

Kollias, S ;

Fellenz, W ;

Taylor, JG .

IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) :32-80

[7]

Crawshaw M., 2020, ARXIV200909796

[8]

Devlin J., 2019, CoRR, V1, P4171

[9]

Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI 10.1145/1143844.1143891

[10] Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM [J].

Hari, Takaaki ;

Watanabe, Shinji ;

Zhang, Yu ;

Chan, William .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :949-953

← 1 2 3 4 5 →