MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model

被引：0

作者：

Chen, Zengzhao ^{[1
,2
]}

Liu, Chuan ^{[1
]}

Wang, Zhifeng ^{[1
]}

Zhao, Chuanxu ^{[1
]}

Lin, Mengting ^{[1
]}

Zheng, Qiuyu ^{[1
]}

机构：

[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China

[2] Natl Intelligent Soc Governance Expt Base Educ, Wuhan 430079, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2025年 / 273卷

基金：

中国国家自然科学基金;

关键词：

Multi-task learning; Speech emotion recognition; Speaker identification; Automatic speech recognition; Speech representation learning;

D O I：

10.1016/j.eswa.2025.126855

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This study proposes a novel Speech Emotion Recognition (SER) approach employing a Multi-Task Learning framework (MTLSER), designed to boost recognition accuracy by training multiple related tasks simultaneously and sharing information via a joint loss function. This framework integrates SER as the primary task, with Automatic Speech Recognition (ASR) and speaker identification serving as auxiliary tasks. Feature extraction is conducted using the pre-trained wav2vec2.0 model, which acts as a shared layer within our multi-task learning (MTL) framework. Extracted features are then processed in parallel by the three tasks. The contributions of auxiliary tasks are adjusted through hyperparameters, and their loss functions are amalgamated into a singular joint loss function for effective backpropagation. This optimization refines the model's internal parameters. Our method's efficacy is tested during the inference stage, where the model concurrently outputs the emotion, textual content, and speaker identity from the input audio. We conducted ablation studies and a sensitivity analysis on the hyperparameters to determine the optimal settings for emotion recognition. The performance of our proposed MTLSER model is evaluated using the public IEMOCAP dataset. Results from extensive testing show a significant improvement over traditional methods, achieving a Weighted Accuracy (WA) of 82.63% and an Unweighted Accuracy (UA) of 82.19%. These findings affirm the effectiveness and robustness of our approach. Our code is publicly available at https://github.com/CCNU-nercel-lc/MTL-SER.

引用

页数：16

共 50 条

[31] Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective
Liu, Ke
Wei, Jiwei
Zou, Jie
Wang, Peng
Yang, Yang
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10623 - 10636
[32] Design of Efficient Speech Emotion Recognition Based on Multi Task Learning
Liu, Yunxiang
Zhang, Kexin
IEEE ACCESS, 2023, 11 : 5528 - 5537
[33] A Primary task driven adaptive loss function for multi-task speech emotion recognition
Liu, Lu-Yao
Liu, Wen-Zhe
Feng, Lin
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 127
[34] Acoustic-to-articulatory Speech Inversion with Multi-task Learning
Siriwardena, Yashish M.
Sivaraman, Ganesh
Espy-Wilson, Carol
INTERSPEECH 2022, 2022, : 5020 - 5024
[35] Emotion recognition in conversations with emotion shift detection based on multi-task learning
Gao, Qingqing
Cao, Biwei
Guan, Xin
Gu, Tianyun
Bao, Xing
Wu, Junyan
Liu, Bo
Cao, Jiuxin
KNOWLEDGE-BASED SYSTEMS, 2022, 248
[36] MULTI-OBJECTIVE MULTI-TASK LEARNING ON RNNLM FOR SPEECH RECOGNITION
Song, Minguang
Zhao, Yunxin
Wang, Shaojun
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 197 - 203
[37] Jointly Predicting Emotion, Age, and Country Using Pre-Trained Acoustic Embedding
Atmaja, Bagus Tris
Zanjabila
Sasou, Akira
2022 10TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2022,
[38] TO REVERSE THE GRADIENT OR NOT: AN EMPIRICAL COMPARISON OF ADVERSARIAL AND MULTI-TASK LEARNING IN SPEECH RECOGNITION
Adi, Yossi
Zeghidour, Neil
Collobert, Ronan
Usunier, Nicolas
Liptchinsky, Vitaliy
Synnaeve, Gabriel
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3742 - 3746
[39] CROSS-CORPUS ACOUSTIC EMOTION RECOGNITION FROM SINGING AND SPEAKING: A MULTI-TASK LEARNING APPROACH
Zhang, Biqiao
Provost, Emily Mower
Essl, Georg
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5805 - 5809
[40] Speaker independent feature selection for speech emotion recognition: A multi-task approach
Elham Kalhor
Behzad Bakhtiari
Multimedia Tools and Applications, 2021, 80 : 8127 - 8146

← 1 2 3 4 5 →