A dynamic-static feature fusion learning network for speech emotion recognition

被引：0

作者：

Xue, Peiyun ^{[1
,2
]}

Gao, Xiang ^{[1
]}

Bai, Jing ^{[1
]}

Dong, Zhenan ^{[1
]}

Wang, Zhiyu ^{[1
]}

Xu, Jiangshuai ^{[1
]}

机构：

[1] Taiyuan Univ Technol, Coll Elect Informat Engn, Taiyuan 030024, Peoples R China

[2] Shanxi Acad Adv Res & Innovat, Taiyuan 030032, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 633卷

关键词：

Speech emotion recognition; Multi-feature Learning Network; Dynamic-Static feature fusion; Hybrid feature representation; Attention mechanism; Cross-corpus; RECURRENT;

D O I：

10.1016/j.neucom.2025.129836

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech is a paramount mode of human communication, and enhancing the quality and fluency of HumanComputer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the RAVDESS dataset, and 94.95 % WA and 94.56 % UA on the EmoDB dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.

引用

页数：15

共 50 条

[21] Speech emotion recognition based on multi‐feature and multi‐lingual fusion
Chunyi Wang
Ying Ren
Na Zhang
Fuwei Cui
Shiying Luo
Multimedia Tools and Applications, 2022, 81 : 4897 - 4907
[22] An Investigation of a Feature-Level Fusion for Noisy Speech Emotion Recognition
Sekkate, Sara
Khalil, Mohammed
Adib, Abdellah
Ben Jebara, Sofia
COMPUTERS, 2019, 8 (04)
[23] Learning Local to Global Feature Aggregation for Speech Emotion Recognition
Lu, Cheng
Lian, Hailun
Zheng, Wenming
Zong, Yuan
Zhao, Yan
Li, Sunan
INTERSPEECH 2023, 2023, : 1908 - 1912
[24] Speech Emotion Recognition Using Global-Aware Cross-Modal Feature Fusion Network
Li, Feng
Luo, Jiusong
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT II, 2023, 14087 : 211 - 221
[25] An autoencoder-based feature level fusion for speech emotion recognition
Peng, Shixin
Kai, Chen
Tian, Tian
Chen, Jingying
DIGITAL COMMUNICATIONS AND NETWORKS, 2024, 10 (05) : 1341 - 1351
[26] Unsupervised Feature Learning for Speech Emotion Recognition Based on Autoencoder
Ying, Yangwei
Tu, Yuanwu
Zhou, Hong
ELECTRONICS, 2021, 10 (17)
[27] Multi-type features separating fusion learning for Speech Emotion Recognition
Xu, Xinlei
Li, Dongdong
Zhou, Yijun
Wang, Zhe
APPLIED SOFT COMPUTING, 2022, 130
[28] Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion
Liu Y.
Chen A.
Zhou G.
Yi J.
Xiang J.
Wang Y.
Multimedia Tools and Applications, 2024, 83 (21) : 59839 - 59859
[29] SPEECH EMOTION RECOGNITION WITH GLOBAL-AWARE FUSION ON MULTI-SCALE FEATURE REPRESENTATION
Zhu, Wenjing
Li, Xiang
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6437 - 6441
[30] Feature representation for speech emotion Recognition
Abdollahpour, Mehdi
Zamani, Lafar
Rad, Hamidreza Saligheh
2017 25TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE), 2017, : 1465 - 1468

← 1 2 3 4 5 →