Multi-task coordinate attention gating network for speech emotion recognition under noisy circumstances

被引：0

作者：

Sun, Linhui ^{[1
]}

Lei, Yunlong ^{[1
]}

Zhang, Zixiao ^{[1
]}

Tang, Yi ^{[1
]}

Wang, Jing ^{[1
]}

Ye, Lei ^{[1
]}

Li, Pingan ^{[1
]}

机构：

[1] Nanjing Univ Posts & Telecommun, Coll Telecommun & Informat Engn, Nanjing, Peoples R China

来源：

BIOMEDICAL SIGNAL PROCESSING AND CONTROL | 2025年 / 107卷

关键词：

Speech emotion recognition; Multi-task learning; Noisy environment; Attention mechanism;

D O I：

10.1016/j.bspc.2025.107811

中图分类号：

R318 [生物医学工程];

学科分类号：

0831 ;

摘要：

Speech emotion recognition (SER) has recently made great progress in ideal environments, but their performance deteriorates dramatically when applied in complex real-world environments, mainly due to poor model robustness and generalization ability. To this end, we propose a multi-task coordinate attention gated network (MTCAGN) framework. For the SER main task, we propose a multi-scale gated convolutional neural network model with the coordinate attention mechanism, which captures a wide range of emotional features at different scales and the key global information, accurately focusing on salient emotional features in speech signals. Speech enhancement is used as an auxiliary task during the training phase, and the overall robustness of the system is strengthened through shared representation learning, allowing it to withstand complex interferences in noisy scenarios. In the inference phase, the speech enhancement branch is removed and only the SER task is retained. Therefore, our proposed method improves the robustness of the SER system without increasing inference complexity. To simulate the noise scenario, we construct three noisy speech datasets by randomly mixing clean audio from IEMOCAP or EMODB dataset with noise from the MUSAN dataset. The empirical findings evince that our proposed model exhibits superior performance in challenging low signal-to-noise ratio environments compared to the present state-of-the-art techniques, as indicated by weighted and unweighted accuracy metrics.

引用

页数：10

共 45 条

[1] Privacy Enhanced Speech Emotion Communication using Deep Learning Aided Edge Computing
Ali, Hafiz Shehbaz
ul Hassan, Fakhar
Latif, Siddique
Manzoor, Habib Ullah
Qadir, Junaid
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS WORKSHOPS (ICC WORKSHOPS), 2021,
[2] CycleGAN-based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition
Bao, Fang
Neumann, Michael
Ngoc Thang Vu
[J]. INTERSPEECH 2019, 2019, : 2828 - 2832
[3] Bhosale S, 2020, INT CONF ACOUST SPEE, P7189, DOI [10.1109/ICASSP40776.2020.9054621, 10.1109/icassp40776.2020.9054621]
[4] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[5] An ongoing review of speech emotion recognition
de Lope, Javier
Grana, Manuel
[J]. NEUROCOMPUTING, 2023, 528 : 1 - 11
[6] Robust Multi-Scenario Speech-Based Emotion Recognition System
Fangfang Zhu-Zhou
Gil-Pita, Roberto
Garcia-Gomez, Joaquin
Rosa-Zurera, Manuel
[J]. SENSORS, 2022, 22 (06)
[7] A review on speech emotion recognition: A survey, recent advances, challenges, and the influence of noise
George, Swapna Mol
Ilyas, P. Muhamed
[J]. NEUROCOMPUTING, 2024, 568
[8] Guimaraes Heitor R, 2022, PREPRINT
[9] Coordinate Attention for Efficient Mobile Network Design
Hou, Qibin
Zhou, Daquan
Feng, Jiashi
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13708 - 13717
[10] EEG-Based Emotion Recognition Using Convolutional Recurrent Neural Network with Multi-Head Self-Attention
Hu, Zhangfang
Chen, Libujie
Luo, Yuan
Zhou, Jingfan
[J]. APPLIED SCIENCES-BASEL, 2022, 12 (21):

← 1 2 3 4 5 →