Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

被引:0
作者
Arash Shilandari
Hossein Marvi
Hossein Khosravi
Wenwu Wang
机构
[1] Shahrood University of Technology,Faculty of Electrical Engineering
[2] University of Surrey,Department of Electrical and Electronic Engineering
来源
Signal, Image and Video Processing | 2022年 / 16卷
关键词
Speech processing; Data augmentation; Speech emotion recognition; Generative adversarial networks;
D O I
暂无
中图分类号
学科分类号
摘要
One of the obstacles in developing speech emotion recognition (SER) systems is the data scarcity problem, i.e., the lack of labeled data for training these systems. Data augmentation is an effective method for increasing the amount of training data. In this paper, we propose a cycle-generative adversarial network (cycle-GAN) for data augmentation in the SER systems. For each of the five emotions considered, an adversarial network is designed to generate data that have a similar distribution to the main data in that class but have a different distribution to those of other classes. These networks are trained in an adversarial way to produce feature vectors similar to those in the training set, which are then added to the original training sets. Instead of using the common cross-entropy loss to train cycle-GANs, we use the Wasserstein divergence to mitigate the gradient vanishing problem and to generate high-quality samples. The proposed network has been applied to SER using the EMO-DB dataset. The quality of the generated data is evaluated using two classifiers based on support vector machine and deep neural network. The results showed that the recognition accuracy in unweighted average recall was about 83.33%, which is better than the baseline methods compared.
引用
收藏
页码:1955 / 1962
页数:7
相关论文
共 42 条
[1]  
El Ayadi M(2011)Survey on speech emotion recognition: features, classification schemes, and databases Pattern Recognit. 44 572-587
[2]  
Kamel MS(2010)A survey on transfer learning IEEE Trans. Knowl. Data Eng. 22 1345-1359
[3]  
Karray F(2018)Deep visual domain adaptation: a survey Neurocomputing 312 135-153
[4]  
Pan SJ(2020)Snore- GANs: improving automatic snore sound classification with synthesized data IEEE J. Biomed. Health Inf. 24 300-310
[5]  
Yang Q(2021)Video-based person-dependent and person-independent facial emotion recognition Signal Image Video Process. 15 1049-1056
[6]  
Wang M(2011)Survey on speech emotion reconition: features, classification schemes, and data-bases Pattern Recognit. 44 572-587
[7]  
Deng W(2021)SEWA DB: a rich database for audio-visual emotion and sentiment research in the wild IEEE Trans. Pattern Anal. Mach. Intell. 43 1022-1040
[8]  
Zhang Z(2002)SMOTE: synthetic minority over-sampling technique J. Artif. Intell. Res. 16 321-357
[9]  
Han J(2018)3-D convolutional recurrent neural networks with attention model for speech emotion recognition IEEE Signal Process. Lett. 25 1440-1444
[10]  
Qian K(2010)Feature analysis and evaluation for automatic emotion identification in speech IEEE Trans. Multimed. 12 490-501