Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation

被引:3
作者
Tao, Huawei [1 ,2 ]
Shan, Shuai [1 ]
Hu, Ziyi [1 ]
Zhu, Chunhua [1 ,2 ]
Ge, Hongyi [1 ,2 ]
机构
[1] Henan Univ Technol, Key Lab Food Informat Proc & Control, Minist Educ, Zhengzhou 450001, Peoples R China
[2] Henan Univ Technol, Henan Engn Lab Grain IOT Technol, Zhengzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
speech emotion recognition; data augmentation; multi-channel feature extractor; Wasserstein distance; feature distributions; speaker-invariant emotional representations;
D O I
10.3390/e25010068
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acoustic data augmentation methods in SER is analyzed, based on which a strong generalized speech emotion recognition model based on effective data augmentation is proposed. The model uses a multi-channel feature extractor consisting of multiple sub-networks to extract emotional representations. Different kinds of augmented data that can effectively improve SER performance are fed into the sub-networks, and the emotional representations are obtained by the weighted fusion of the output feature maps of each sub-network. And in order to make the model robust to unseen speakers, we employ adversarial training to generalize emotion representations. A discriminator is used to estimate the Wasserstein distance between the feature distributions of different speakers and to force the feature extractor to learn the speaker-invariant emotional representations by adversarial training. The simulation experimental results on the IEMOCAP corpus show that the performance of the proposed method is 2-9% ahead of the related SER algorithm, which proves the effectiveness of the proposed method.
引用
收藏
页数:16
相关论文
共 38 条
[1]  
Aldeneh Z, 2017, INT CONF ACOUST SPEE, P2741, DOI 10.1109/ICASSP.2017.7952655
[2]  
Arjovsky M, 2017, PR MACH LEARN RES, V70
[3]  
Braunschweiler N., 2021, P 2021 IEEE AUT SPEE, P24, DOI DOI 10.1109/ASRU51503.2021.9687987
[4]  
Burkhardt Felix, 2005, P 9 EUR C SPEECH COM
[5]   MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception [J].
Busso, Carlos ;
Parthasarathy, Srinivas ;
Burmania, Alec ;
AbdelWahab, Mohammed ;
Sadoughi, Najmeh ;
Provost, Emily Mower .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (01) :67-80
[6]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[7]   Data Augmentation using GANs for Speech Emotion Recognition [J].
Chatziagapi, Aggelina ;
Paraskevopoulos, Georgios ;
Sgouropoulos, Dimitris ;
Pantazopoulos, Georgios ;
Nikandrou, Malvina ;
Giannakopoulos, Theodoros ;
Katsamanis, Athanasios ;
Potamianos, Alexandros ;
Narayanan, Shrikanth .
INTERSPEECH 2019, 2019, :171-175
[8]  
Chenchah Farah, 2019, 2019 International Conference on Signal, Control and Communication (SCC), P274, DOI 10.1109/SCC47175.2019.9116103
[9]   Emotion Classification Using Segmentation of Vowel-Like and Non-Vowel-Like Regions [J].
Deb, Suman ;
Dandapat, Samarendra .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (03) :360-373
[10]   ISNet: Individual Standardization Network for Speech Emotion Recognition [J].
Fan, Weiquan ;
Xu, Xiangmin ;
Cai, Bolun ;
Xing, Xiaofen .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :1803-1814