A dynamic-static feature fusion learning network for speech emotion recognition

被引：0

作者：

Xue, Peiyun ^{[1
,2
]}

Gao, Xiang ^{[1
]}

Bai, Jing ^{[1
]}

Dong, Zhenan ^{[1
]}

Wang, Zhiyu ^{[1
]}

Xu, Jiangshuai ^{[1
]}

机构：

[1] Taiyuan Univ Technol, Coll Elect Informat Engn, Taiyuan 030024, Peoples R China

[2] Shanxi Acad Adv Res & Innovat, Taiyuan 030032, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 633卷

关键词：

Speech emotion recognition; Multi-feature Learning Network; Dynamic-Static feature fusion; Hybrid feature representation; Attention mechanism; Cross-corpus; RECURRENT;

D O I：

10.1016/j.neucom.2025.129836

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech is a paramount mode of human communication, and enhancing the quality and fluency of HumanComputer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the RAVDESS dataset, and 94.95 % WA and 94.56 % UA on the EmoDB dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.

引用

页数：15

共 50 条

[1] Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition
Dong, Ke
Peng, Hao
Che, Jie
MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 350 - 361
[2] DSTM: A transformer-based model with dynamic-static feature fusion in speech emotion recognition
Jin, Guowei
Xu, Yunfeng
Kang, Hong
Wang, Jialin
Miao, Borui
COMPUTER SPEECH AND LANGUAGE, 2025, 90
[3] HIERARCHICAL NETWORK BASED ON THE FUSION OF STATIC AND DYNAMIC FEATURES FOR SPEECH EMOTION RECOGNITION
Cao, Qi
Hou, Mixiao
Chen, Bingzhi
Zhang, Zheng
Lu, Guangming
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6334 - 6338
[4] Speech Emotion Recognition Based on Feature Fusion
Shen, Qi
Chen, Guanggen
Chang, Lin
PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 1071 - 1074
[5] Speech emotion recognition using feature fusion: a hybrid approach to deep learning
Khan, Waleed Akram
ul Qudous, Hamad
Farhan, Asma Ahmad
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (31) : 75557 - 75584
[6] A FEATURE FUSION METHOD BASED ON EXTREME LEARNING MACHINE FOR SPEECH EMOTION RECOGNITION
Guo, Lili
Wang, Longbiao
Dang, Jianwu
Zhang, Linjuan
Guan, Haotian
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2666 - 2670
[7] Speech emotion recognition with unsupervised feature learning
Zheng-wei Huang
Wen-tao Xue
Qi-rong Mao
Frontiers of Information Technology & Electronic Engineering, 2015, 16 : 358 - 366
[8] Speech Emotion Recognition based on Multiple Feature Fusion
Jiang, Changjiang
Mao, Rong
Liu, Geng
Wang, Mingyi
2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 907 - 912
[9] Speech emotion recognition with unsupervised feature learning
Huang, Zheng-wei
Xue, Wen-tao
Mao, Qi-rong
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2015, 16 (05) : 358 - 366
[10] Discriminative Feature Learning for Speech Emotion Recognition
Zhang, Yuying
Zou, Yuexian
Peng, Junyi
Luo, Danqing
Huang, Dongyan
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: TEXT AND TIME SERIES, PT IV, 2019, 11730 : 198 - 210

← 1 2 3 4 5 →