SPEAKER NORMALIZATION FOR SELF-SUPERVISED SPEECH EMOTION RECOGNITION

被引:24
作者
Gat, Itai [1 ]
Aronowitz, Hagai [1 ]
Zhu, Weizhong [1 ]
Morais, Edmilson [1 ]
Hoory, Ron [1 ]
机构
[1] IBM Res AI, Albany, NY 12203 USA
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Speech emotion recognition; speaker normalization; self-supervised learning;
D O I
10.1109/ICASSP43922.2022.9747460
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.
引用
收藏
页码:7342 / 7346
页数:5
相关论文
共 31 条
  • [21] Panayotov Vassil, 2015, ICASSP 2015
  • [22] Pepino L., 2021, INTERSPEECH
  • [23] Rivière M, 2020, INT CONF ACOUST SPEE, P7414, DOI [10.1109/icassp40776.2020.9054548, 10.1109/ICASSP40776.2020.9054548]
  • [24] wav2vec: Unsupervised Pre-training for Speech Recognition
    Schneider, Steffen
    Baevski, Alexei
    Collobert, Ronan
    Auli, Michael
    [J]. INTERSPEECH 2019, 2019, : 3465 - 3469
  • [25] WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition
    Shen, Guang
    Lai, Riwei
    Chen, Rui
    Zhang, Yu
    Zhang, Kejia
    Han, Qilong
    Song, Hongtao
    [J]. INTERSPEECH 2020, 2020, : 369 - 373
  • [26] van den Oord Aaron, 2018, arXiv
  • [27] Vlasenko Bogdan, 2007, COMBINING FRAME TURN, P1
  • [28] Wang Jianyou, 2020, ICASSP
  • [29] Yang Shu-wen, 2021, ARXIV210501051
  • [30] Yoon Seunghyun, 2018, SLT