SPEAKER NORMALIZATION FOR SELF-SUPERVISED SPEECH EMOTION RECOGNITION

被引:39
作者
Gat, Itai [1 ]
Aronowitz, Hagai [1 ]
Zhu, Weizhong [1 ]
Morais, Edmilson [1 ]
Hoory, Ron [1 ]
机构
[1] IBM Res AI, Albany, NY 12203 USA
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Speech emotion recognition; speaker normalization; self-supervised learning;
D O I
10.1109/ICASSP43922.2022.9747460
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.
引用
收藏
页码:7342 / 7346
页数:5
相关论文
共 31 条
[21]  
Panayotov Vassil, 2015, ICASSP 2015
[22]  
Pepino L., 2021, INTERSPEECH
[23]  
Rivière M, 2020, INT CONF ACOUST SPEE, P7414, DOI [10.1109/icassp40776.2020.9054548, 10.1109/ICASSP40776.2020.9054548]
[24]   wav2vec: Unsupervised Pre-training for Speech Recognition [J].
Schneider, Steffen ;
Baevski, Alexei ;
Collobert, Ronan ;
Auli, Michael .
INTERSPEECH 2019, 2019, :3465-3469
[25]   WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition [J].
Shen, Guang ;
Lai, Riwei ;
Chen, Rui ;
Zhang, Yu ;
Zhang, Kejia ;
Han, Qilong ;
Song, Hongtao .
INTERSPEECH 2020, 2020, :369-373
[26]  
van den Oord Aaron, 2018, CoRR, DOI 10.48550/arxiv.1807.03748
[27]  
Vlasenko Bogdan, 2007, COMBINING FRAME TURN, P1
[28]  
Wang Jianyou, 2020, ICASSP
[29]  
Yang Shu-wen, 2021, ARXIV210501051
[30]  
Yoon Seunghyun, 2018, SLT