MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model

被引:0
|
作者
Chen, Zengzhao [1 ,2 ]
Liu, Chuan [1 ]
Wang, Zhifeng [1 ]
Zhao, Chuanxu [1 ]
Lin, Mengting [1 ]
Zheng, Qiuyu [1 ]
机构
[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China
[2] Natl Intelligent Soc Governance Expt Base Educ, Wuhan 430079, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-task learning; Speech emotion recognition; Speaker identification; Automatic speech recognition; Speech representation learning;
D O I
10.1016/j.eswa.2025.126855
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study proposes a novel Speech Emotion Recognition (SER) approach employing a Multi-Task Learning framework (MTLSER), designed to boost recognition accuracy by training multiple related tasks simultaneously and sharing information via a joint loss function. This framework integrates SER as the primary task, with Automatic Speech Recognition (ASR) and speaker identification serving as auxiliary tasks. Feature extraction is conducted using the pre-trained wav2vec2.0 model, which acts as a shared layer within our multi-task learning (MTL) framework. Extracted features are then processed in parallel by the three tasks. The contributions of auxiliary tasks are adjusted through hyperparameters, and their loss functions are amalgamated into a singular joint loss function for effective backpropagation. This optimization refines the model's internal parameters. Our method's efficacy is tested during the inference stage, where the model concurrently outputs the emotion, textual content, and speaker identity from the input audio. We conducted ablation studies and a sensitivity analysis on the hyperparameters to determine the optimal settings for emotion recognition. The performance of our proposed MTLSER model is evaluated using the public IEMOCAP dataset. Results from extensive testing show a significant improvement over traditional methods, achieving a Weighted Accuracy (WA) of 82.63% and an Unweighted Accuracy (UA) of 82.19%. These findings affirm the effectiveness and robustness of our approach. Our code is publicly available at https://github.com/CCNU-nercel-lc/MTL-SER.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective
    Liu, Ke
    Wei, Jiwei
    Zou, Jie
    Wang, Peng
    Yang, Yang
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10623 - 10636
  • [32] Design of Efficient Speech Emotion Recognition Based on Multi Task Learning
    Liu, Yunxiang
    Zhang, Kexin
    IEEE ACCESS, 2023, 11 : 5528 - 5537
  • [33] A Primary task driven adaptive loss function for multi-task speech emotion recognition
    Liu, Lu-Yao
    Liu, Wen-Zhe
    Feng, Lin
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 127
  • [34] Acoustic-to-articulatory Speech Inversion with Multi-task Learning
    Siriwardena, Yashish M.
    Sivaraman, Ganesh
    Espy-Wilson, Carol
    INTERSPEECH 2022, 2022, : 5020 - 5024
  • [35] Emotion recognition in conversations with emotion shift detection based on multi-task learning
    Gao, Qingqing
    Cao, Biwei
    Guan, Xin
    Gu, Tianyun
    Bao, Xing
    Wu, Junyan
    Liu, Bo
    Cao, Jiuxin
    KNOWLEDGE-BASED SYSTEMS, 2022, 248
  • [36] MULTI-OBJECTIVE MULTI-TASK LEARNING ON RNNLM FOR SPEECH RECOGNITION
    Song, Minguang
    Zhao, Yunxin
    Wang, Shaojun
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 197 - 203
  • [37] Jointly Predicting Emotion, Age, and Country Using Pre-Trained Acoustic Embedding
    Atmaja, Bagus Tris
    Zanjabila
    Sasou, Akira
    2022 10TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2022,
  • [38] TO REVERSE THE GRADIENT OR NOT: AN EMPIRICAL COMPARISON OF ADVERSARIAL AND MULTI-TASK LEARNING IN SPEECH RECOGNITION
    Adi, Yossi
    Zeghidour, Neil
    Collobert, Ronan
    Usunier, Nicolas
    Liptchinsky, Vitaliy
    Synnaeve, Gabriel
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3742 - 3746
  • [39] CROSS-CORPUS ACOUSTIC EMOTION RECOGNITION FROM SINGING AND SPEAKING: A MULTI-TASK LEARNING APPROACH
    Zhang, Biqiao
    Provost, Emily Mower
    Essl, Georg
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5805 - 5809
  • [40] Speaker independent feature selection for speech emotion recognition: A multi-task approach
    Elham Kalhor
    Behzad Bakhtiari
    Multimedia Tools and Applications, 2021, 80 : 8127 - 8146