Cross Corpus Speech Emotion Recognition using transfer learning and attention-based fusion of Wav2Vec2 and prosody features

被引:8
作者
Naderi, Navid [1 ]
Nasersharif, Babak [1 ]
机构
[1] KN Toosi Univ Technol, Dept Comp Engn, Shariati St, Tehran, Iran
关键词
Cross-corpus speech emotion recognition; Transfer learning; Domain adaptation; Attention; Feature fusion; Wav2Vec2; DOMAIN ADAPTATION; ADVERSARIAL; REDUCTION;
D O I
10.1016/j.knosys.2023.110814
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech Emotion Recognition (SER) performance degrades when their training and test conditions or corpora differ. Cross-corpus SER (CCSER) is a research branch that discusses adapting an SER system to identify speech emotions on a corpus that has different recording conditions or language from the training corpus. For CCSER, adaption can be performed in the feature extraction module or emotion classifier, which are the two main components of the SER system. In this paper, we propose AFTL method (attention-based feature fusion along with transfer learning), including methods in both feature extraction and classification for CCSER. In the feature extraction part, we use Wav2Vec 2.0 transformer blocks and prosody features, and we propose an attention method for fusing them. In the classifier part, we use transfer learning for transferring the knowledge of a model trained on source emotional speech corpus to recognize emotions on a target corpus. We performed experiments on numerous speech emotional datasets as target corpora, where we used IEMOCAP as the source corpus. For instance, we achieve 92.45% accuracy on the EmoDB dataset, where we only use 20% of speakers for adapting the source model. In addition, for other target corpora, we obtained admissible results.& COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:11
相关论文
共 83 条
  • [11] Chen LW, 2023, Arxiv, DOI arXiv:2110.06309
  • [12] Chen L, 2021, Arxiv, DOI arXiv:2104.02851
  • [13] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
    Chen, Sanyuan
    Wang, Chengyi
    Chen, Zhengyang
    Wu, Yu
    Liu, Shujie
    Chen, Zhuo
    Li, Jinyu
    Kanda, Naoyuki
    Yoshioka, Takuya
    Xiao, Xiong
    Wu, Jian
    Zhou, Long
    Ren, Shuo
    Qian, Yanmin
    Qian, Yao
    Zeng, Michael
    Yu, Xiangzhan
    Wei, Furu
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1505 - 1518
  • [14] Choi K., 2022, arXiv
  • [15] An Unsupervised Autoregressive Model for Speech Representation Learning
    Chung, Yu-An
    Hsu, Wei-Ning
    Tang, Hao
    Glass, James
    [J]. INTERSPEECH 2019, 2019, : 146 - 150
  • [16] Costantini G, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3501
  • [17] Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier
    Daneshfar, Fatemeh
    Kabudian, Seyed Jahanshah
    Neekabadi, Abbas
    [J]. APPLIED ACOUSTICS, 2020, 166
  • [18] Modeling prosodic features with joint factor analysis for speaker verification
    Dehak, Najim
    Dumouchel, Pierre
    Kenny, Patrick
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07): : 2095 - 2103
  • [19] Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition
    Deng, Jun
    Xu, Xinzhou
    Zhang, Zixing
    Fruhholz, Sascha
    Schuller, Bjorn
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2017, 24 (04) : 500 - 504
  • [20] Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition
    Deng, Jun
    Zhang, Zixing
    Eyben, Florian
    Schuller, Bjoern
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (09) : 1068 - 1072