Multi-task coordinate attention gating network for speech emotion recognition under noisy circumstances

被引:0
作者
Sun, Linhui [1 ]
Lei, Yunlong [1 ]
Zhang, Zixiao [1 ]
Tang, Yi [1 ]
Wang, Jing [1 ]
Ye, Lei [1 ]
Li, Pingan [1 ]
机构
[1] Nanjing Univ Posts & Telecommun, Coll Telecommun & Informat Engn, Nanjing, Peoples R China
关键词
Speech emotion recognition; Multi-task learning; Noisy environment; Attention mechanism;
D O I
10.1016/j.bspc.2025.107811
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Speech emotion recognition (SER) has recently made great progress in ideal environments, but their performance deteriorates dramatically when applied in complex real-world environments, mainly due to poor model robustness and generalization ability. To this end, we propose a multi-task coordinate attention gated network (MTCAGN) framework. For the SER main task, we propose a multi-scale gated convolutional neural network model with the coordinate attention mechanism, which captures a wide range of emotional features at different scales and the key global information, accurately focusing on salient emotional features in speech signals. Speech enhancement is used as an auxiliary task during the training phase, and the overall robustness of the system is strengthened through shared representation learning, allowing it to withstand complex interferences in noisy scenarios. In the inference phase, the speech enhancement branch is removed and only the SER task is retained. Therefore, our proposed method improves the robustness of the SER system without increasing inference complexity. To simulate the noise scenario, we construct three noisy speech datasets by randomly mixing clean audio from IEMOCAP or EMODB dataset with noise from the MUSAN dataset. The empirical findings evince that our proposed model exhibits superior performance in challenging low signal-to-noise ratio environments compared to the present state-of-the-art techniques, as indicated by weighted and unweighted accuracy metrics.
引用
收藏
页数:10
相关论文
共 45 条
  • [1] Privacy Enhanced Speech Emotion Communication using Deep Learning Aided Edge Computing
    Ali, Hafiz Shehbaz
    ul Hassan, Fakhar
    Latif, Siddique
    Manzoor, Habib Ullah
    Qadir, Junaid
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS WORKSHOPS (ICC WORKSHOPS), 2021,
  • [2] CycleGAN-based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition
    Bao, Fang
    Neumann, Michael
    Ngoc Thang Vu
    [J]. INTERSPEECH 2019, 2019, : 2828 - 2832
  • [3] Bhosale S, 2020, INT CONF ACOUST SPEE, P7189, DOI [10.1109/ICASSP40776.2020.9054621, 10.1109/icassp40776.2020.9054621]
  • [4] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [5] An ongoing review of speech emotion recognition
    de Lope, Javier
    Grana, Manuel
    [J]. NEUROCOMPUTING, 2023, 528 : 1 - 11
  • [6] Robust Multi-Scenario Speech-Based Emotion Recognition System
    Fangfang Zhu-Zhou
    Gil-Pita, Roberto
    Garcia-Gomez, Joaquin
    Rosa-Zurera, Manuel
    [J]. SENSORS, 2022, 22 (06)
  • [7] A review on speech emotion recognition: A survey, recent advances, challenges, and the influence of noise
    George, Swapna Mol
    Ilyas, P. Muhamed
    [J]. NEUROCOMPUTING, 2024, 568
  • [8] Guimaraes Heitor R, 2022, PREPRINT
  • [9] Coordinate Attention for Efficient Mobile Network Design
    Hou, Qibin
    Zhou, Daquan
    Feng, Jiashi
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13708 - 13717
  • [10] EEG-Based Emotion Recognition Using Convolutional Recurrent Neural Network with Multi-Head Self-Attention
    Hu, Zhangfang
    Chen, Libujie
    Luo, Yuan
    Zhou, Jingfan
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (21):