MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model

被引：0

作者：

Chen, Zengzhao ^{[1
,2
]}

Liu, Chuan ^{[1
]}

Wang, Zhifeng ^{[1
]}

Zhao, Chuanxu ^{[1
]}

Lin, Mengting ^{[1
]}

Zheng, Qiuyu ^{[1
]}

机构：

[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China

[2] Natl Intelligent Soc Governance Expt Base Educ, Wuhan 430079, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2025年 / 273卷

基金：

中国国家自然科学基金;

关键词：

Multi-task learning; Speech emotion recognition; Speaker identification; Automatic speech recognition; Speech representation learning;

D O I：

10.1016/j.eswa.2025.126855

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This study proposes a novel Speech Emotion Recognition (SER) approach employing a Multi-Task Learning framework (MTLSER), designed to boost recognition accuracy by training multiple related tasks simultaneously and sharing information via a joint loss function. This framework integrates SER as the primary task, with Automatic Speech Recognition (ASR) and speaker identification serving as auxiliary tasks. Feature extraction is conducted using the pre-trained wav2vec2.0 model, which acts as a shared layer within our multi-task learning (MTL) framework. Extracted features are then processed in parallel by the three tasks. The contributions of auxiliary tasks are adjusted through hyperparameters, and their loss functions are amalgamated into a singular joint loss function for effective backpropagation. This optimization refines the model's internal parameters. Our method's efficacy is tested during the inference stage, where the model concurrently outputs the emotion, textual content, and speaker identity from the input audio. We conducted ablation studies and a sensitivity analysis on the hyperparameters to determine the optimal settings for emotion recognition. The performance of our proposed MTLSER model is evaluated using the public IEMOCAP dataset. Results from extensive testing show a significant improvement over traditional methods, achieving a Weighted Accuracy (WA) of 82.63% and an Unweighted Accuracy (UA) of 82.19%. These findings affirm the effectiveness and robustness of our approach. Our code is publicly available at https://github.com/CCNU-nercel-lc/MTL-SER.

引用

页数：16

共 50 条

[21] Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation
Mitra, Vikramjit
Chien, Hsiang-Yun Sherry
Kowtha, Vasudha
Cheng, Joseph Yitan
Azemi, Erdrin
INTERSPEECH 2022, 2022, : 4715 - 4719
[22] Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition
Latif, Siddique
Rana, Rajib
Khalifa, Sara
Jurdak, Raja
Epps, Julien
Schuller, Bjoern W.
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) : 992 - 1004
[23] Towards multi-task learning of speech and speaker recognition
Vaessen, Nik
van Leeuwen, David A.
INTERSPEECH 2023, 2023, : 4898 - 4902
[24] EmoComicNet: A multi-task model for comic emotion recognition
Dutta, Arpita
Biswas, Samit
Das, Amit Kumar
PATTERN RECOGNITION, 2024, 150
[25] I-VECTOR ESTIMATION AS AUXILIARY TASK FOR MULTI-TASK LEARNING BASED ACOUSTIC MODELING FOR AUTOMATIC SPEECH RECOGNITION
Pironkov, Gueorgui
Dupont, Stephane
Dutoit, Thierry
2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 1 - 7
[26] Automatic Speech Recognition Dataset Augmentation with Pre-Trained Model and Script
Kwon, Minsu
Choi, Ho-Jin
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 649 - 651
[27] Multi-Task Conformer with Multi-Feature Combination for Speech Emotion Recognition
Seo, Jiyoung
Lee, Bowon
SYMMETRY-BASEL, 2022, 14 (07):
[28] Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning
Kim, Jaebok
Englebienne, Gwenn
Truong, Khiet P.
Evers, Vanessa
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1113 - 1117
[29] MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders Are Better Dense Retrievers
Zhou, Kun
Liu, Xiao
Gong, Yeyun
Zhao, Wayne Xin
Jiang, Daxin
Duan, Nan
Wen, Ji-Rong
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT II, 2023, 14170 : 630 - 647
[30] Poster Abstract: Speech Emotion Recognition via Attention-based DNN from Multi-Task Learning
Ma, Fei
Gu, Weixi
Zhang, Wei
Ni, Shiguang
Huang, Shao-Lun
Zhang, Lin
SENSYS'18: PROCEEDINGS OF THE 16TH CONFERENCE ON EMBEDDED NETWORKED SENSOR SYSTEMS, 2018, : 363 - 364

← 1 2 3 4 5 →