Speech emotion recognition model based on Bi-GRU and Focal Loss

被引:64
作者
Zhu, Zijiang [1 ,2 ]
Dai, Weihuang [3 ]
Hu, Yi [1 ]
Li, Junshan [1 ,2 ]
机构
[1] Guangdong Univ Foreign Studies, South China Business Coll, Sch Informat Sci & Technol, Guangzhou 510545, Peoples R China
[2] Guangdong Univ Foreign Studies, South China Business Coll, Inst Intelligent Informat Proc, Guangzhou 510545, Peoples R China
[3] Guangdong Univ Foreign Studies, South China Business Coll, Human Resources Dept, Guangzhou 510545, Peoples R China
关键词
Bi-GRU; Focal loss; Speech emotion recognition; Deep learning; CRNN; NEURAL-NETWORKS; SMOTE;
D O I
10.1016/j.patrec.2020.11.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For the problems of inconsistent sample duration and unbalance of sample categories in the speech emotion corpus, this paper proposes a speech emotion recognition model based on Bi-GRU (Bidirection Gated Recurrent Unit) and Focal Loss. The model has been improved on the basis of learning CRNN (Convolutional Recurrent Neural Network) deeply. In CRNN, Bi-GRU is used to effectively lengthen the samples of the speech with short duration, and Focal Loss function is used to deal with the difficulties in classification caused by the imbalance of emotional categories of the samples. Through different methods for experimental comparison, weighted average recall (WAR), unweighted average recall (UAR) and confusion matrix (CM) are used as evaluation index of the algorithm. The experimental results show that the speech emotion recognition model proposed in this paper improves the recognition accuracy and the imbalance of IEMOCAP database samples, and can effectively prove that the improvement of speech emotion recognition performance is not due to the adjustment of model parameters or the change of the model topology. (c) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页码:358 / 365
页数:8
相关论文
共 23 条
[1]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[2]   An evaluation of Convolutional Neural Networks for music classification using spectrograms [J].
Costa, Yandre M. G. ;
Oliveira, Luiz S. ;
Silla, Carlos N., Jr. .
APPLIED SOFT COMPUTING, 2017, 52 :28-38
[3]   Semisupervised Autoencoders for Speech Emotion Recognition [J].
Deng, Jun ;
Xu, Xinzhou ;
Zhang, Zixing ;
Fruehholz, Sascha ;
Schuller, Bjorn .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (01) :31-43
[4]   KA-Ensemble: towards imbalanced image classification ensembling under-sampling and over-sampling [J].
Ding, Hao ;
Wei, Bin ;
Gu, Zhaorui ;
Yu, Zhibin ;
Zheng, Haiyong ;
Zheng, Bing ;
Li, Juan .
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (21-22) :14871-14888
[5]  
Fayek HM, 2015, 2015 9TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS)
[6]   Evaluating deep learning architectures for Speech Emotion Recognition [J].
Fayek, Haytham M. ;
Lech, Margaret ;
Cavedon, Lawrence .
NEURAL NETWORKS, 2017, 92 :60-68
[7]   Learning from Imbalanced Data [J].
He, Haibo ;
Garcia, Edwardo A. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (09) :1263-1284
[8]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[9]   Speech emotion recognition using emotion perception spectral feature [J].
Jiang, Lin ;
Tan, Ping ;
Yang, Junfeng ;
Liu, Xingbao ;
Wang, Chao .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (11)
[10]  
Lim W., 2016, ASIAPAC SIGN INFO PR, P1, DOI [10.1109/APSIPA.2016.7820699, DOI 10.1109/APSIPA.2016.7820699]