A bimodal network based on Audio-Text-Interactional-Attention with ArcFace loss for speech emotion recognition

被引:12
作者
Tang, Yuwu [1 ,3 ]
Hu, Ying [1 ,3 ]
He, Liang [3 ,4 ]
Huang, Hao [2 ,3 ]
机构
[1] Key Lab Signal Detect & Proc Xinjiang, Xinjiang, Peoples R China
[2] Key Lab Multilingual Informat Technol Xinjiang, Xinjiang, Peoples R China
[3] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[4] Tsinghua Univ, Dept Elect Engn, Tsinghua Natl Lab Informat Sci & Technol, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Bimodal network; Audio-Text-Interactional-Attention; ArcFace loss; Speech emotion recognition; FEATURES; MODALITIES; FACE;
D O I
10.1016/j.specom.2022.07.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech emotion recognition (SER) is an essential part of human-computer interaction. Meanwhile, the SER has widely utilized multimodal information in SER in recent years. This paper focuses on exploiting the acoustic and textual modalities for the SER task. We propose a bimodal network based on an Audio-Text-Interactional-Attention (ATIA) structure which can facilitate the interaction and fusion of the emotionally salient information within the acoustic and textual modalities. We also explored four different ATIA structures and verified their effectiveness. Finally, we selected one ATIA structure to build our bimodal network with the best performance. Furthermore, our SER model adopts an additive angular margin loss, named ArcFace loss, applied to the deep face recognition field. Compared with the widespread Softmax loss, our visualization results demonstrated the effectiveness of the ArcFace loss function. ArcFace loss can improve the discriminate power of features by focusing on the angles between the features and the weights. As we know, it is the first time to apply ArcFace loss in the field of SER. Finally, the results show that the bimodal network combined ArcFace loss achieved 72.8% of Weighted Accuracy (WA) and 62.5% of Unweighted Accuracy (UA) for the seven-class emotion classification, and 82.4% of WA and 80.6% of UA for the four-class emotion classification on the IEMOCAP dataset.
引用
收藏
页码:21 / 32
页数:12
相关论文
共 50 条
  • [1] Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers
    Akcay, Mehmet Berkehan
    Oguz, Kaya
    [J]. SPEECH COMMUNICATION, 2020, 116 (116) : 56 - 76
  • [2] Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011
    Anagnostopoulos, Christos-Nikolaos
    Iliou, Theodoros
    Giannoukos, Ioannis
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2015, 43 (02) : 155 - 177
  • [3] [Anonymous], 2018, Generating wikipedia by summarizing long sequences
  • [4] Busso C., 2004, PROC INT C MULTIMODA, P205
  • [5] Busso C., 2013, Social Emotions in Nature and Artifact, P110, DOI DOI 10.1093/ACPROF:OSO/9780195387643.003.0008
  • [6] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [7] Chen ZY, 2019, ASIAPAC SIGN INFO PR, P445, DOI [10.1109/apsipaasc47483.2019.9023165, 10.1109/APSIPAASC47483.2019.9023165]
  • [8] Cho KYHY, 2014, Arxiv, DOI arXiv:1406.1078
  • [9] Dai DY, 2019, INT CONF ACOUST SPEE, P7405, DOI [10.1109/icassp.2019.8683765, 10.1109/ICASSP.2019.8683765]
  • [10] ArcFace: Additive Angular Margin Loss for Deep Face Recognition
    Deng, Jiankang
    Guo, Jia
    Xue, Niannan
    Zafeiriou, Stefanos
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4685 - 4694