Speech Emotion Recognition with Multi-task Learning

被引:56
作者
Cai, Xingyu [1 ]
Yuan, Jiahong [1 ]
Zheng, Renjie [1 ]
Huang, Liang [1 ]
Church, Kenneth [1 ]
机构
[1] Baidu Res, Sunnyvale, CA 94089 USA
来源
INTERSPEECH 2021 | 2021年
关键词
speech emotion recognition; multi-task learning; MODELS;
D O I
10.21437/Interspeech.2021-1852
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speech emotion recognition (SER) classifies speech into emotion categories such as: Happy, Angry, Sad and Neutral. Recently, deep learning has been applied to the SER task. This paper proposes a multi-task learning (MTL) framework to simultaneously perform speech-to-text recognition and emotion classification, with an end-to-end deep neural model based on wav2vec-2.0. Experiments on the IEMOCAP benchmark show that the proposed method achieves the state-of-the-art performance on the SER task. In addition, an ablation study establishes the effectiveness of the proposed MTL framework.
引用
收藏
页码:4508 / 4512
页数:5
相关论文
共 41 条
[1]   Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers [J].
Akcay, Mehmet Berkehan ;
Oguz, Kaya .
SPEECH COMMUNICATION, 2020, 116 (116) :56-76
[2]  
[Anonymous], 2011, SPEECH COMMUNICATION, DOI DOI 10.3389/FNANA.2011.00046
[3]  
Baevski A., 2020, Advances in Neural Information Processing Systems 33
[4]  
Bhosale S, 2020, INT CONF ACOUST SPEE, P7189, DOI [10.1109/ICASSP40776.2020.9054621, 10.1109/icassp40776.2020.9054621]
[5]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[6]   Emotion recognition in human-computer interaction [J].
Cowie, R ;
Douglas-Cowie, E ;
Tsapatsoulis, N ;
Votsis, G ;
Kollias, S ;
Fellenz, W ;
Taylor, JG .
IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) :32-80
[7]  
Crawshaw M., 2020, ARXIV200909796
[8]  
Devlin J., 2019, CoRR, V1, P4171
[9]  
Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI 10.1145/1143844.1143891
[10]   Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM [J].
Hari, Takaaki ;
Watanabe, Shinji ;
Zhang, Yu ;
Chan, William .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :949-953