Cross Corpus Speech Emotion Recognition using transfer learning and attention-based fusion of Wav2Vec2 and prosody features

被引:8
作者
Naderi, Navid [1 ]
Nasersharif, Babak [1 ]
机构
[1] KN Toosi Univ Technol, Dept Comp Engn, Shariati St, Tehran, Iran
关键词
Cross-corpus speech emotion recognition; Transfer learning; Domain adaptation; Attention; Feature fusion; Wav2Vec2; DOMAIN ADAPTATION; ADVERSARIAL; REDUCTION;
D O I
10.1016/j.knosys.2023.110814
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech Emotion Recognition (SER) performance degrades when their training and test conditions or corpora differ. Cross-corpus SER (CCSER) is a research branch that discusses adapting an SER system to identify speech emotions on a corpus that has different recording conditions or language from the training corpus. For CCSER, adaption can be performed in the feature extraction module or emotion classifier, which are the two main components of the SER system. In this paper, we propose AFTL method (attention-based feature fusion along with transfer learning), including methods in both feature extraction and classification for CCSER. In the feature extraction part, we use Wav2Vec 2.0 transformer blocks and prosody features, and we propose an attention method for fusing them. In the classifier part, we use transfer learning for transferring the knowledge of a model trained on source emotional speech corpus to recognize emotions on a target corpus. We performed experiments on numerous speech emotional datasets as target corpora, where we used IEMOCAP as the source corpus. For instance, we achieve 92.45% accuracy on the EmoDB dataset, where we only use 20% of speakers for adapting the source model. In addition, for other target corpora, we obtained admissible results.& COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:11
相关论文
共 83 条
  • [1] Domain Adversarial for Acoustic Emotion Recognition
    Abdelwahab, Mohammed
    Busso, Carlos
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (12) : 2423 - 2435
  • [2] Agarla M, 2023, Arxiv, DOI arXiv:2207.06767
  • [3] Cross-Corpus Speech Emotion Recognition Based on Few-Shot Learning and Domain Adaptation
    Ahn, Youngdo
    Lee, Sung Joo
    Shin, Jong Won
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 1190 - 1194
  • [4] [Anonymous], 2005, P INTERSPEECH, DOI DOI 10.21437/INTERSPEECH.2005-446
  • [5] [Anonymous], 2010, Multimodal Emotion Recognition, DOI DOI 10.4018/978-1-61520-919-4
  • [6] SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker Embedding and Vision Transformers
    Arezzo, Alessandro
    Berretti, Stefano
    [J]. PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA IN ASIA, MMASIA 2022, 2022,
  • [7] Baevski A., 2020, Advances in neural information processing systems
  • [8] Burkhardt F, 2022, LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P1917
  • [9] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [10] DISTILHUBERT: SPEECH REPRESENTATION LEARNING BY LAYER-WISE DISTILLATION OF HIDDEN-UNIT BERT
    Chang, Heng-Jui
    Yang, Shu-wen
    Lee, Hung-yi
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7087 - 7091