Vector learning representation for generalized speech emotion recognition

被引:5
作者
Singkul, Sattaya [1 ]
Woraratpanya, Kuntpong [1 ]
机构
[1] King Mongkuts Inst Technol Ladkrabang, Fac Informat Technol, 1 Chalong Krung, Bangkok 10520, Thailand
关键词
Speech emotion recognition; Residual squeeze excitation network; Normalized log mel spectrogram; Speech emotion verification; Verify-to-classify framework; Softmax with angular prototypical loss; Cross environment; End-to-end learning; FEATURES; CLASSIFICATION;
D O I
10.1016/j.heliyon.2022.e09196
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Speech emotion recognition (SER) plays an important role in global business today to improve service efficiency. In the literature of SER, many techniques have been using deep learning to extract and learn features. Recently, we have proposed end-to-end learning for a deep residual local feature learning block (DeepResLFLB). The advantages of end-to-end learning are low engineering effort and less hyperparameter tuning. Nevertheless, this learning method is easily to fall into an overfitting problem. Therefore, this paper described the concept of the "verify-to-classify" framework to apply for learning vectors, extracted from feature spaces of emotional information. This framework consists of two important portions: speech emotion learning and recognition. In speech emotion learning, consisting of two steps: speech emotion verification enrolled training and prediction, the residual learning (ResNet) with squeeze-excitation (SE) block was used as a core component of both steps to extract emotional state vectors and build an emotion model by the speech emotion verification enrolled training. Then the in-domain pre-trained weights of the emotion trained model are transferred to the prediction step. As a result of the speech emotion learning, the accepted model-validated by EER-is transferred to the speech emotion recognition in terms of out-domain pre-trained weights, which are ready for classification using a classical ML method. In this manner, a suitable loss function is important to work with emotional vectors. Here, two loss functions were proposed: angular prototypical and softmax with angular prototypical losses. Based on two publicly available datasets: Emo-DB and RAVDESS, both with high-and low-quality environments. The experimental results show that our proposed method can significantly improve generalized performance and explainable emotion results, when evaluated by standard metrics: EER, accuracy, precision, recall, and F1-score.
引用
收藏
页数:13
相关论文
共 71 条
[1]   Principal component analysis [J].
Abdi, Herve ;
Williams, Lynne J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04) :433-459
[2]  
Aggarwal CC, 2001, LECT NOTES COMPUT SC, V1973, P420
[3]  
Agrawal P, 2014, 2014 INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION CONTROL AND COMPUTING TECHNOLOGIES (ICACCCT), P1295, DOI 10.1109/ICACCCT.2014.7019308
[4]   Local Sigmoid Method: Non-Iterative Deterministic Learning Algorithm for Automatic Model Construction of Neural Network [J].
Alfarozi, Syukron Abu Ishaq ;
Pasupa, Kitsuchart ;
Sugimoto, Masanori ;
Woraratpanya, Kuntpong .
IEEE ACCESS, 2020, 8 (08) :20342-20362
[5]   Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 [J].
Anagnostopoulos, Christos-Nikolaos ;
Iliou, Theodoros ;
Giannoukos, Ioannis .
ARTIFICIAL INTELLIGENCE REVIEW, 2015, 43 (02) :155-177
[6]  
[Anonymous], 2007, 1 INT C BIOMETRICS T
[7]   Deep Speaker Embeddings for Short-Duration Speaker Verification [J].
Bhattacharya, Gautam ;
Alam, Jahangir ;
Kenny, Patrick .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1517-1521
[8]   The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample [J].
Breitenstein, C ;
Van Lancker, D ;
Daum, I .
COGNITION & EMOTION, 2001, 15 (01) :57-79
[9]  
Burkhardt F., 2005, P 9 EUR C SPEECH COM, V5, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[10]  
Cai W., 2018, ODYSSEY SPEAKER LANG, P74, DOI 10.21437/Odyssey.2018-11