Research and Implementation of Speech Emotion Recognition Based on CGRU Model

被引:2
作者
Zheng Y. [1 ]
Chen J.-N. [1 ]
Wu F. [1 ]
Fu B. [1 ]
机构
[1] School of Information Science & Engineering, Northeastern University, Shenyang
来源
Dongbei Daxue Xuebao/Journal of Northeastern University | 2020年 / 41卷 / 12期
关键词
CGRU model; Data augmentation; Mel-frequency cepstral coefficients; Random forest; Speech emotion recognition;
D O I
10.12068/j.issn.1005-3026.2020.12.002
中图分类号
学科分类号
摘要
Speech emotion recognition is a very important research direction in emotion computing and human-computer interaction. At present, deep neural network is widely used to extract emotional features of speech, but further research is needed on which neural network model to use and how to alleviate the problem of model overfitting. To solve these problems, a CGRU model was proposed, which combined one dimensional convolutional neural networks (CNN) and gated circulation unit (GRU). The low-order and high-order emotional features of speech were extracted from the MFCC features of the original speech signal, and the features were selected through random forest, which achieved 79%, 69% and 75% recognition accuracy respectively on three common emotional corpus: EMODB, SAVEE, RAVDESS. By using the data augmentation technique, the sample size was increased by adding gaussian noise and changing the speed, which further improved the identification accuracy. The availability of the model in the real world was verified through the online identification system. © 2020, Editorial Department of Journal of Northeastern University. All right reserved.
引用
收藏
页码:1680 / 1685
页数:5
相关论文
共 11 条
[1]  
Picard R W., Affective computing, pp. 14-16, (1997)
[2]  
Kim Y, Lee H, Provost E M., Deep learning for robust feature generation in audiovisual emotion recognition, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3687-3691, (2013)
[3]  
Deng J, Zhang Z, Marchi E, Et al., Sparse autoencoder-based feature transfer learning for speech emotion recognition, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 511-516, (2013)
[4]  
Lee J, Tashev I., High-level feature representation using recurrent neural network for speech emotion recognition, Interspeech, 5, 1, pp. 10-13, (2015)
[5]  
LeCun Y, Bottou L, Bengio Y, Et al., Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86, 11, pp. 2278-2324, (1998)
[6]  
LeCun Y, Bengio Y., Convolutional networks for images, speech, and time series, The handbook of brain theory and neural networks, pp. 255-257, (1995)
[7]  
Likitha M S, Gupta S R R, Hasitha K, Et al., Speech based human emotion recognition using MFCC, 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 2257-2260, (2017)
[8]  
Burkhardt F, Paeschke A, Rolfes M, Et al., A database of German emotional speech, Proceedings of Interspeech 2005, pp. 1517-1520, (2005)
[9]  
Jackson P, Haq S., Surrey audio-visual expressed emotion (SAVEE) database
[10]  
Livingstone S R, Russo F A, Joseph N., The Ryerson audio-visual database of emotional speech and song: a dynamic, multimodal set of facial and vocal expressions in North American English, PLOS ONE, 13, 5, pp. 15-19, (2001)