TOWARDS IMPROVING SPEECH EMOTION RECOGNITION USING SYNTHETIC DATA AUGMENTATION FROM EMOTION CONVERSION

被引:4
作者
Ibrahim, Karim M. [1 ]
Perzol, Antony [1 ]
Leglaive, Simon [2 ]
机构
[1] Emobot, Paris, France
[2] IETR UMR CNRS 6164, Cent Supelec, Rennes, France
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年
关键词
speech emotion recognition; synthetic data; data augmentation; speech generation;
D O I
10.1109/ICASSP48485.2024.10445740
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
One of the main challenges in speech emotion recognition is the lack of large labelled datasets. The progress in speech synthesis allows us to generate reliable and realistic expressive speech. In this work, we propose using a state-of-the-art end-to-end speech emotion conversion model to generate new synthetic data for training speech emotion recognition models. We first evaluate the quality of the converted speech on new unseen datasets, which proves to be on par with the training data. Then, we study the effect of using the synthesized speech as data augmentation. We show that this approach improves the overall performance of emotion recognition models on two different datasets, IEMOCAP and RAVDESS, both in the cases of speaker dependent and independent emotion recognition using a fine-tuned wav2vec 2.0.
引用
收藏
页码:10636 / 10640
页数:5
相关论文
共 33 条
[1]  
Adigwe A, 2018, Arxiv, DOI arXiv:1806.09514
[2]  
Baevski A, 2020, ADV NEUR IN, V33
[3]  
Bao F., 2019, P INT SPEECH COMMUN
[4]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[5]   Data Augmentation using GANs for Speech Emotion Recognition [J].
Chatziagapi, Aggelina ;
Paraskevopoulos, Georgios ;
Sgouropoulos, Dimitris ;
Pantazopoulos, Georgios ;
Nikandrou, Malvina ;
Giannakopoulos, Theodoros ;
Katsamanis, Athanasios ;
Potamianos, Alexandros ;
Narayanan, Shrikanth .
INTERSPEECH 2019, 2019, :171-175
[6]   StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation [J].
Choi, Yunjey ;
Choi, Minje ;
Kim, Munyoung ;
Ha, Jung-Woo ;
Kim, Sunghun ;
Choo, Jaegul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8789-8797
[7]  
Eskimez S. E., 2020, P INT SPEECH COMMUN
[8]   Evaluating deep learning architectures for Speech Emotion Recognition [J].
Fayek, Haytham M. ;
Lech, Margaret ;
Cavedon, Lawrence .
NEURAL NETWORKS, 2017, 92 :60-68
[9]  
Goodfellow J., 2014, Advances in Neural Information Processing Systems
[10]  
He X., 2021, P INT SPEECH COMMUN