SPEAKER NORMALIZATION FOR SELF-SUPERVISED SPEECH EMOTION RECOGNITION

被引:39
作者
Gat, Itai [1 ]
Aronowitz, Hagai [1 ]
Zhu, Weizhong [1 ]
Morais, Edmilson [1 ]
Hoory, Ron [1 ]
机构
[1] IBM Res AI, Albany, NY 12203 USA
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Speech emotion recognition; speaker normalization; self-supervised learning;
D O I
10.1109/ICASSP43922.2022.9747460
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.
引用
收藏
页码:7342 / 7346
页数:5
相关论文
共 31 条
[1]  
[Anonymous], 2016, ICASSP
[2]  
[Anonymous], IEEE SIGNAL PROCESSI
[3]  
Baevski A., 2020, Advances in neural information processing systems
[4]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[5]   Vector-Quantized Autoregressive Predictive Coding [J].
Chung, Yu-An ;
Tang, Hao ;
Glass, James .
INTERSPEECH 2020, 2020, :3760-3764
[6]   An Unsupervised Autoregressive Model for Speech Representation Learning [J].
Chung, Yu-An ;
Hsu, Wei-Ning ;
Tang, Hao ;
Glass, James .
INTERSPEECH 2019, 2019, :146-150
[7]  
GANIN Y, 2015, ICML, DOI DOI 10.48550/ARXIV.1409.7495
[8]  
Hsu W.-N., 2021, ARXIV210607447
[9]  
Jalal Asif, 2020, INTERSPEECH
[10]  
Keren Gil, 2016, IJCNN, P2