A bimodal network based on Audio-Text-Interactional-Attention with ArcFace loss for speech emotion recognition

被引:15
作者
Tang, Yuwu [1 ,3 ]
Hu, Ying [1 ,3 ]
He, Liang [3 ,4 ]
Huang, Hao [2 ,3 ]
机构
[1] Key Lab Signal Detect & Proc Xinjiang, Xinjiang, Peoples R China
[2] Key Lab Multilingual Informat Technol Xinjiang, Xinjiang, Peoples R China
[3] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[4] Tsinghua Univ, Dept Elect Engn, Tsinghua Natl Lab Informat Sci & Technol, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Bimodal network; Audio-Text-Interactional-Attention; ArcFace loss; Speech emotion recognition; FEATURES; MODALITIES; FACE;
D O I
10.1016/j.specom.2022.07.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech emotion recognition (SER) is an essential part of human-computer interaction. Meanwhile, the SER has widely utilized multimodal information in SER in recent years. This paper focuses on exploiting the acoustic and textual modalities for the SER task. We propose a bimodal network based on an Audio-Text-Interactional-Attention (ATIA) structure which can facilitate the interaction and fusion of the emotionally salient information within the acoustic and textual modalities. We also explored four different ATIA structures and verified their effectiveness. Finally, we selected one ATIA structure to build our bimodal network with the best performance. Furthermore, our SER model adopts an additive angular margin loss, named ArcFace loss, applied to the deep face recognition field. Compared with the widespread Softmax loss, our visualization results demonstrated the effectiveness of the ArcFace loss function. ArcFace loss can improve the discriminate power of features by focusing on the angles between the features and the weights. As we know, it is the first time to apply ArcFace loss in the field of SER. Finally, the results show that the bimodal network combined ArcFace loss achieved 72.8% of Weighted Accuracy (WA) and 62.5% of Unweighted Accuracy (UA) for the seven-class emotion classification, and 82.4% of WA and 80.6% of UA for the four-class emotion classification on the IEMOCAP dataset.
引用
收藏
页码:21 / 32
页数:12
相关论文
共 50 条
[31]   A review of affective computing: From unimodal analysis to multimodal fusion [J].
Poria, Soujanya ;
Cambria, Erik ;
Bajpai, Rajiv ;
Hussain, Amir .
INFORMATION FUSION, 2017, 37 :98-125
[32]  
Priyasad D, 2020, INT CONF ACOUST SPEE, P3227, DOI [10.1109/icassp40776.2020.9054441, 10.1109/ICASSP40776.2020.9054441]
[33]  
Rabiner L., 1993, Fundamentals of Speech Recognition
[34]   Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms [J].
Satt, Aharon ;
Rozenberg, Shai ;
Hoory, Ron .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1089-1093
[35]   Emotion Perception from Face, Voice, and Touch: Comparisons and Convergence [J].
Schirmer, Annett ;
Adolphs, Ralph .
TRENDS IN COGNITIVE SCIENCES, 2017, 21 (03) :216-228
[36]  
Schuller B, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P577
[37]   Sensory modalities are not separate modalities: plasticity and interactions [J].
Shimojo, S ;
Shams, L .
CURRENT OPINION IN NEUROBIOLOGY, 2001, 11 (04) :505-509
[38]   Self-attention for Speech Emotion Recognition [J].
Tarantino, Lorenzo ;
Garner, Philip N. ;
Lazaridis, Alexandros .
INTERSPEECH 2019, 2019, :2578-2582
[39]  
Trigeorgis G, 2016, INT CONF ACOUST SPEE, P5200, DOI 10.1109/ICASSP.2016.7472669
[40]  
Vaswani A, 2017, ADV NEUR IN, V30