Speech Emotion Recognition with Multi-task Learning

被引:40
作者
Cai, Xingyu [1 ]
Yuan, Jiahong [1 ]
Zheng, Renjie [1 ]
Huang, Liang [1 ]
Church, Kenneth [1 ]
机构
[1] Baidu Res, Sunnyvale, CA 94089 USA
来源
INTERSPEECH 2021 | 2021年
关键词
speech emotion recognition; multi-task learning; MODELS;
D O I
10.21437/Interspeech.2021-1852
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speech emotion recognition (SER) classifies speech into emotion categories such as: Happy, Angry, Sad and Neutral. Recently, deep learning has been applied to the SER task. This paper proposes a multi-task learning (MTL) framework to simultaneously perform speech-to-text recognition and emotion classification, with an end-to-end deep neural model based on wav2vec-2.0. Experiments on the IEMOCAP benchmark show that the proposed method achieves the state-of-the-art performance on the SER task. In addition, an ablation study establishes the effectiveness of the proposed MTL framework.
引用
收藏
页码:4508 / 4512
页数:5
相关论文
共 41 条
  • [1] Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers
    Akcay, Mehmet Berkehan
    Oguz, Kaya
    [J]. SPEECH COMMUNICATION, 2020, 116 (116) : 56 - 76
  • [2] [Anonymous], 2011, SPEECH COMMUNICATION, DOI DOI 10.3389/FNANA.2011.00046
  • [3] Baevski A., 2020, Advances in Neural Information Processing Systems 33
  • [4] Bhosale S, 2020, INT CONF ACOUST SPEE, P7189, DOI [10.1109/ICASSP40776.2020.9054621, 10.1109/icassp40776.2020.9054621]
  • [5] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [6] Emotion recognition in human-computer interaction
    Cowie, R
    Douglas-Cowie, E
    Tsapatsoulis, N
    Votsis, G
    Kollias, S
    Fellenz, W
    Taylor, JG
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) : 32 - 80
  • [7] Crawshaw M., 2020, ARXIV200909796
  • [8] Devlin J., 2019, CoRR, V1, P4171
  • [9] Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI 10.1145/1143844.1143891
  • [10] Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
    Hari, Takaaki
    Watanabe, Shinji
    Zhang, Yu
    Chan, William
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 949 - 953