A bimodal network based on Audio-Text-Interactional-Attention with ArcFace loss for speech emotion recognition

被引：12

作者：

Tang, Yuwu ^{[1
,3
]}

Hu, Ying ^{[1
,3
]}

He, Liang ^{[3
,4
]}

Huang, Hao ^{[2
,3
]}

机构：

[1] Key Lab Signal Detect & Proc Xinjiang, Xinjiang, Peoples R China

[2] Key Lab Multilingual Informat Technol Xinjiang, Xinjiang, Peoples R China

[3] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China

[4] Tsinghua Univ, Dept Elect Engn, Tsinghua Natl Lab Informat Sci & Technol, Beijing, Peoples R China

来源：

SPEECH COMMUNICATION | 2022年 / 143卷

基金：

中国国家自然科学基金;

关键词：

Bimodal network; Audio-Text-Interactional-Attention; ArcFace loss; Speech emotion recognition; FEATURES; MODALITIES; FACE;

D O I：

10.1016/j.specom.2022.07.004

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech emotion recognition (SER) is an essential part of human-computer interaction. Meanwhile, the SER has widely utilized multimodal information in SER in recent years. This paper focuses on exploiting the acoustic and textual modalities for the SER task. We propose a bimodal network based on an Audio-Text-Interactional-Attention (ATIA) structure which can facilitate the interaction and fusion of the emotionally salient information within the acoustic and textual modalities. We also explored four different ATIA structures and verified their effectiveness. Finally, we selected one ATIA structure to build our bimodal network with the best performance. Furthermore, our SER model adopts an additive angular margin loss, named ArcFace loss, applied to the deep face recognition field. Compared with the widespread Softmax loss, our visualization results demonstrated the effectiveness of the ArcFace loss function. ArcFace loss can improve the discriminate power of features by focusing on the angles between the features and the weights. As we know, it is the first time to apply ArcFace loss in the field of SER. Finally, the results show that the bimodal network combined ArcFace loss achieved 72.8% of Weighted Accuracy (WA) and 62.5% of Unweighted Accuracy (UA) for the seven-class emotion classification, and 82.4% of WA and 80.6% of UA for the four-class emotion classification on the IEMOCAP dataset.

引用

页码：21 / 32

页数：12

共 50 条

[1] Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers
Akcay, Mehmet Berkehan
Oguz, Kaya
[J]. SPEECH COMMUNICATION, 2020, 116 (116) : 56 - 76
[2] Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011
Anagnostopoulos, Christos-Nikolaos
Iliou, Theodoros
Giannoukos, Ioannis
[J]. ARTIFICIAL INTELLIGENCE REVIEW, 2015, 43 (02) : 155 - 177
[3] [Anonymous], 2018, Generating wikipedia by summarizing long sequences
[4] Busso C., 2004, PROC INT C MULTIMODA, P205
[5] Busso C., 2013, Social Emotions in Nature and Artifact, P110, DOI DOI 10.1093/ACPROF:OSO/9780195387643.003.0008
[6] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[7] Chen ZY, 2019, ASIAPAC SIGN INFO PR, P445, DOI [10.1109/apsipaasc47483.2019.9023165, 10.1109/APSIPAASC47483.2019.9023165]
[8] Cho KYHY, 2014, Arxiv, DOI arXiv:1406.1078
[9] Dai DY, 2019, INT CONF ACOUST SPEE, P7405, DOI [10.1109/icassp.2019.8683765, 10.1109/ICASSP.2019.8683765]
[10] ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Deng, Jiankang
Guo, Jia
Xue, Niannan
Zafeiriou, Stefanos
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4685 - 4694

← 1 2 3 4 5 →