A dynamic-static feature fusion learning network for speech emotion recognition

被引:0
|
作者
Xue, Peiyun [1 ,2 ]
Gao, Xiang [1 ]
Bai, Jing [1 ]
Dong, Zhenan [1 ]
Wang, Zhiyu [1 ]
Xu, Jiangshuai [1 ]
机构
[1] Taiyuan Univ Technol, Coll Elect Informat Engn, Taiyuan 030024, Peoples R China
[2] Shanxi Acad Adv Res & Innovat, Taiyuan 030032, Peoples R China
关键词
Speech emotion recognition; Multi-feature Learning Network; Dynamic-Static feature fusion; Hybrid feature representation; Attention mechanism; Cross-corpus; RECURRENT;
D O I
10.1016/j.neucom.2025.129836
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech is a paramount mode of human communication, and enhancing the quality and fluency of HumanComputer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the RAVDESS dataset, and 94.95 % WA and 94.56 % UA on the EmoDB dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition
    Qi, Xin
    Wen, Yujun
    Zhang, Pengzhou
    Huang, Heyan
    NEUROCOMPUTING, 2025, 611
  • [42] Emotion Recognition via Multiscale Feature Fusion Network and Attention Mechanism
    Jiang, Yiye
    Xie, Songyun
    Xie, Xinzhou
    Cui, Yujie
    Tang, Hao
    IEEE SENSORS JOURNAL, 2023, 23 (10) : 10790 - 10800
  • [43] Transfer Learning of Deep Neural Network for Speech Emotion Recognition
    Huang, Ying
    Hu, Mingqing
    Yu, Xianguo
    Wang, Tao
    Yang, Chen
    PATTERN RECOGNITION (CCPR 2016), PT II, 2016, 663 : 721 - 729
  • [44] Static, Dynamic and Acceleration Features for CNN-Based Speech Emotion Recognition
    Khalifa, Intissar
    Ejbali, Ridha
    Napoletano, Paolo
    Schettini, Raimondo
    Zaied, Mourad
    AIXIA 2021 - ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, 13196 : 348 - 358
  • [45] Speech Emotion Recognition Using Transfer Learning
    Song, Peng
    Jin, Yun
    Zhao, Li
    Xin, Minghai
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (09): : 2530 - 2532
  • [46] CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network
    Mustageem
    Kwon, Soonil
    MATHEMATICS, 2020, 8 (12) : 1 - 19
  • [47] Linked Source and Target Domain Subspace Feature Transfer Learning - Exemplified by Speech Emotion Recognition
    Deng, Jun
    Zhang, Zixing
    Schuller, Bjoern
    2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 761 - 766
  • [48] A Multi-Feature Fusion Speech Emotion Recognition Method Based on Frequency Band Division and Improved Residual Network
    Guo, Yi
    Zhou, Yongping
    Xiong, Xuejun
    Jiang, Xin
    Tian, Hanbing
    Zhang, Qianxue
    IEEE ACCESS, 2023, 11 : 86013 - 86024
  • [49] MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network
    Jothimani, S.
    Premalatha, K.
    CHAOS SOLITONS & FRACTALS, 2022, 162
  • [50] A Parallel-Model Speech Emotion Recognition Network Based on Feature Clustering
    Zhang, Li-Min
    Ng, Giap Weng
    Leau, Yu-Beng
    Yan, Hao
    IEEE ACCESS, 2023, 11 : 71224 - 71234