Improving Speech Emotion Recognition With Adversarial Data Augmentation Network

被引:72
作者
Yi, Lu [1 ]
Mak, Man-Wai [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Elect & Informat Engn, Hong Kong, Peoples R China
关键词
Generators; Feature extraction; Training; Emotion recognition; Speech recognition; Generative adversarial networks; Gallium nitride; Data augmentation; generative adversarial networks (GANs); speech emotion recognition; Wasserstein divergence; NEURAL-NETWORKS; MODEL;
D O I
10.1109/TNNLS.2020.3027600
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
When training data are scarce, it is challenging to train a deep neural network without causing the overfitting problem. For overcoming this challenge, this article proposes a new data augmentation network-namely adversarial data augmentation network (ADAN)- based on generative adversarial networks (GANs). The ADAN consists of a GAN, an autoencoder, and an auxiliary classifier. These networks are trained adversarially to synthesize class-dependent feature vectors in both the latent space and the original feature space, which can be augmented to the real training data for training classifiers. Instead of using the conventional cross-entropy loss for adversarial training, the Wasserstein divergence is used in an attempt to produce high-quality synthetic samples. The proposed networks were applied to speech emotion recognition using EmoDB and IEMOCAP as the evaluation data sets. It was found that by forcing the synthetic latent vectors and the real latent vectors to share a common representation, the gradient vanishing problem can be largely alleviated. Also, results show that the augmented data generated by the proposed networks are rich in emotion information. Thus, the resulting emotion classifiers are competitive with state-of-the-art speech emotion recognition systems.
引用
收藏
页码:172 / 184
页数:13
相关论文
共 59 条
[41]   A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers [J].
Negahban, Sahand N. ;
Ravikumar, Pradeep ;
Wainwright, Martin J. ;
Yu, Bin .
STATISTICAL SCIENCE, 2012, 27 (04) :538-557
[42]   A Survey on Transfer Learning [J].
Pan, Sinno Jialin ;
Yang, Qiang .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2010, 22 (10) :1345-1359
[43]   On Enhancing Speech Emotion Recognition using Generative Adversarial Networks [J].
Sahu, Saurabh ;
Gupta, Rahul ;
Espy-Wilson, Carol .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3693-3697
[44]   Adversarial Auto-encoders for Speech Based Emotion Recognition [J].
Sahu, Saurabh ;
Gupta, Rahul ;
Sivaraman, Ganesh ;
AbdAlmageed, Wael ;
Espy-Wilson, Carol .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1243-1247
[45]  
Schuller B, 2003, INT CONF ACOUST SPEE, P1
[46]  
Schuller B, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P3208
[47]  
Schuller B, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2798
[48]  
Schuller B, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P336
[49]  
Shen J, 2018, AAAI CONF ARTIF INTE, P4058
[50]  
Storkey A, 2017, INT C LEARNING REPRE