Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation

被引：3

作者：

Tao, Huawei ^{[1
,2
]}

Shan, Shuai ^{[1
]}

Hu, Ziyi ^{[1
]}

Zhu, Chunhua ^{[1
,2
]}

Ge, Hongyi ^{[1
,2
]}

机构：

[1] Henan Univ Technol, Key Lab Food Informat Proc & Control, Minist Educ, Zhengzhou 450001, Peoples R China

[2] Henan Univ Technol, Henan Engn Lab Grain IOT Technol, Zhengzhou, Peoples R China

来源：

ENTROPY | 2023年 / 25卷 / 01期

基金：

中国国家自然科学基金;

关键词：

speech emotion recognition; data augmentation; multi-channel feature extractor; Wasserstein distance; feature distributions; speaker-invariant emotional representations;

D O I：

10.3390/e25010068

中图分类号：

O4 [物理学];

学科分类号：

0702 ;

摘要：

The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acoustic data augmentation methods in SER is analyzed, based on which a strong generalized speech emotion recognition model based on effective data augmentation is proposed. The model uses a multi-channel feature extractor consisting of multiple sub-networks to extract emotional representations. Different kinds of augmented data that can effectively improve SER performance are fed into the sub-networks, and the emotional representations are obtained by the weighted fusion of the output feature maps of each sub-network. And in order to make the model robust to unseen speakers, we employ adversarial training to generalize emotion representations. A discriminator is used to estimate the Wasserstein distance between the feature distributions of different speakers and to force the feature extractor to learn the speaker-invariant emotional representations by adversarial training. The simulation experimental results on the IEMOCAP corpus show that the performance of the proposed method is 2-9% ahead of the related SER algorithm, which proves the effectiveness of the proposed method.

引用

页数：16

共 38 条

[1]

Aldeneh Z, 2017, INT CONF ACOUST SPEE, P2741, DOI 10.1109/ICASSP.2017.7952655

[2]

Arjovsky M, 2017, PR MACH LEARN RES, V70

[3]

Braunschweiler N., 2021, P 2021 IEEE AUT SPEE, P24, DOI DOI 10.1109/ASRU51503.2021.9687987

[4]

Burkhardt Felix, 2005, P 9 EUR C SPEECH COM

[5] MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception [J].

Busso, Carlos ;

Parthasarathy, Srinivas ;

Burmania, Alec ;

AbdelWahab, Mohammed ;

Sadoughi, Najmeh ;

Provost, Emily Mower .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (01) :67-80

[6] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[7] Data Augmentation using GANs for Speech Emotion Recognition [J].

Chatziagapi, Aggelina ;

Paraskevopoulos, Georgios ;

Sgouropoulos, Dimitris ;

Pantazopoulos, Georgios ;

Nikandrou, Malvina ;

Giannakopoulos, Theodoros ;

Katsamanis, Athanasios ;

Potamianos, Alexandros ;

Narayanan, Shrikanth .

INTERSPEECH 2019, 2019, :171-175

[8]

Chenchah Farah, 2019, 2019 International Conference on Signal, Control and Communication (SCC), P274, DOI 10.1109/SCC47175.2019.9116103

[9] Emotion Classification Using Segmentation of Vowel-Like and Non-Vowel-Like Regions [J].

Deb, Suman ;

Dandapat, Samarendra .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (03) :360-373

[10] ISNet: Individual Standardization Network for Speech Emotion Recognition [J].

Fan, Weiquan ;

Xu, Xiangmin ;

Cai, Bolun ;

Xing, Xiaofen .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :1803-1814

← 1 2 3 4 →