MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model

被引:0
|
作者
Chen, Zengzhao [1 ,2 ]
Liu, Chuan [1 ]
Wang, Zhifeng [1 ]
Zhao, Chuanxu [1 ]
Lin, Mengting [1 ]
Zheng, Qiuyu [1 ]
机构
[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China
[2] Natl Intelligent Soc Governance Expt Base Educ, Wuhan 430079, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-task learning; Speech emotion recognition; Speaker identification; Automatic speech recognition; Speech representation learning;
D O I
10.1016/j.eswa.2025.126855
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study proposes a novel Speech Emotion Recognition (SER) approach employing a Multi-Task Learning framework (MTLSER), designed to boost recognition accuracy by training multiple related tasks simultaneously and sharing information via a joint loss function. This framework integrates SER as the primary task, with Automatic Speech Recognition (ASR) and speaker identification serving as auxiliary tasks. Feature extraction is conducted using the pre-trained wav2vec2.0 model, which acts as a shared layer within our multi-task learning (MTL) framework. Extracted features are then processed in parallel by the three tasks. The contributions of auxiliary tasks are adjusted through hyperparameters, and their loss functions are amalgamated into a singular joint loss function for effective backpropagation. This optimization refines the model's internal parameters. Our method's efficacy is tested during the inference stage, where the model concurrently outputs the emotion, textual content, and speaker identity from the input audio. We conducted ablation studies and a sensitivity analysis on the hyperparameters to determine the optimal settings for emotion recognition. The performance of our proposed MTLSER model is evaluated using the public IEMOCAP dataset. Results from extensive testing show a significant improvement over traditional methods, achieving a Weighted Accuracy (WA) of 82.63% and an Unweighted Accuracy (UA) of 82.19%. These findings affirm the effectiveness and robustness of our approach. Our code is publicly available at https://github.com/CCNU-nercel-lc/MTL-SER.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation
    Mitra, Vikramjit
    Chien, Hsiang-Yun Sherry
    Kowtha, Vasudha
    Cheng, Joseph Yitan
    Azemi, Erdrin
    INTERSPEECH 2022, 2022, : 4715 - 4719
  • [22] Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition
    Latif, Siddique
    Rana, Rajib
    Khalifa, Sara
    Jurdak, Raja
    Epps, Julien
    Schuller, Bjoern W.
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) : 992 - 1004
  • [23] Towards multi-task learning of speech and speaker recognition
    Vaessen, Nik
    van Leeuwen, David A.
    INTERSPEECH 2023, 2023, : 4898 - 4902
  • [24] EmoComicNet: A multi-task model for comic emotion recognition
    Dutta, Arpita
    Biswas, Samit
    Das, Amit Kumar
    PATTERN RECOGNITION, 2024, 150
  • [25] I-VECTOR ESTIMATION AS AUXILIARY TASK FOR MULTI-TASK LEARNING BASED ACOUSTIC MODELING FOR AUTOMATIC SPEECH RECOGNITION
    Pironkov, Gueorgui
    Dupont, Stephane
    Dutoit, Thierry
    2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 1 - 7
  • [26] Automatic Speech Recognition Dataset Augmentation with Pre-Trained Model and Script
    Kwon, Minsu
    Choi, Ho-Jin
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 649 - 651
  • [27] Multi-Task Conformer with Multi-Feature Combination for Speech Emotion Recognition
    Seo, Jiyoung
    Lee, Bowon
    SYMMETRY-BASEL, 2022, 14 (07):
  • [28] Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning
    Kim, Jaebok
    Englebienne, Gwenn
    Truong, Khiet P.
    Evers, Vanessa
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1113 - 1117
  • [29] MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders Are Better Dense Retrievers
    Zhou, Kun
    Liu, Xiao
    Gong, Yeyun
    Zhao, Wayne Xin
    Jiang, Daxin
    Duan, Nan
    Wen, Ji-Rong
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT II, 2023, 14170 : 630 - 647
  • [30] Poster Abstract: Speech Emotion Recognition via Attention-based DNN from Multi-Task Learning
    Ma, Fei
    Gu, Weixi
    Zhang, Wei
    Ni, Shiguang
    Huang, Shao-Lun
    Zhang, Lin
    SENSYS'18: PROCEEDINGS OF THE 16TH CONFERENCE ON EMBEDDED NETWORKED SENSOR SYSTEMS, 2018, : 363 - 364