Multi-type features separating fusion learning for Speech Emotion Recognition

被引：15

作者：

Xu, Xinlei ^{[1
,2
]}

Li, Dongdong ^{[2
]}

Zhou, Yijun ^{[2
]}

Wang, Zhe ^{[1
,2
]}

机构：

[1] East China Univ Sci Technol, Key Lab Smart Mfg Energy Chem Proc, Minist Educ, Shanghai 200237, Peoples R China

[2] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China

来源：

APPLIED SOFT COMPUTING | 2022年 / 130卷

基金：

中国国家自然科学基金;

关键词：

Speech emotion recognition; Hybrid feature selection; Feature-level fusion; Speaker-independent; CONVOLUTIONAL NEURAL-NETWORKS; GMM; REPRESENTATIONS; CLASSIFICATION; ADAPTATION; RECURRENT; CNN;

D O I：

10.1016/j.asoc.2022.109648

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech Emotion Recognition (SER) is a challengeable task to improve human-computer interaction. Speech data have different representations, and choosing the appropriate features to express the emotion behind the speech is difficult. The human brain can comprehensively judge the same thing in different dimensional representations to obtain the final result. Inspired by this, we believe that it is reasonable to have complementary advantages between different representations of speech data. Therefore, a Hybrid Deep Learning with Multi-type features Model (HD-MFM) is proposed to integrate the acoustic, temporal and image information of speech. Specifically, we utilize Convolutional Neural Network (CNN) to extract image information from the spectrogram of speech. Deep Neural Network (DNN) is used for extracting the acoustic information from the statistic features of speech. Then, Long Short-Term Memory (LSTM) is chosen to extract the temporal information from the Mel-Frequency Cepstral Coefficients (MFCC) of speech. Finally, three different types of speech features are concatenated together to get a richer emotion representation with better discriminative property. Considering that different fusion strategies affect the relationship between features, we consider two fusion strategies in this paper named separating and merging. To evaluate the feasibility and effectiveness of the proposed HD-MFM, we perform extensive experiments on EMO-DB and IEMOCAP of SER. The experimental results show that the separating method has more significant advantages in feature complementarity. The proposed HD-MFM obtains 91.25% and 72.02% results on EMO-DB and IEMOCAP. The obtained results indicate the proposed HD-MFM can make full use of the effective complementary feature representations by separating strategy to further enhance the performance of SER. (c) 2022 Elsevier B.V. All rights reserved.

引用

页数：13

共 69 条

[51] EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION
Peng, Zixuan
Lu, Yu
Pan, Shengfeng
Liu, Yunfeng
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3020 - 3024
[52] Speech emotion recognition approaches in human computer interaction
Ramakrishnan, S.
El Emary, Ibrahiem M. M.
[J]. TELECOMMUNICATION SYSTEMS, 2013, 52 (03) : 1467 - 1478
[53] Rozgic V, 2012, ASIAPAC SIGN INFO PR
[54] Schuller B, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P336
[55] Segment-based approach to the recognition of emotions in speech
Shami, MT
Kamel, MS
[J]. 2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, 2005, : 366 - 369
[56] A Novel Approach for Trajectory Tracking Control of an Under-Actuated Quad-Rotor UAV
Shao, Ke
Huang, Kang
Zhen, Shengchao
Sun, Hao
Yu, Rongrong
[J]. IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2024, 11 (09) : 2030 - 2032
[57] MFCC-based descriptor for bee queen presence detection
Soares, Bianca Sousa
Luz, Jederson Sousa
de Macedo, Valderlandia Francisca
Veloso e Silva, Romuere Rodrigues
Duarte de Araujo, Flavio Henrique
Vieira Magalhaes, Deborah Maria
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2022, 201
[58] Weighted spectral features based on local Hu moments for speech emotion recognition
Sun, Yaxin
Wen, Guihua
Wang, Jiabing
[J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2015, 18 : 80 - 90
[59] Tang DK, 2018, INTERSPEECH, P162
[60] Vondra M, 2009, LECT NOTES COMPUT SC, V5641, P98, DOI 10.1007/978-3-642-03320-9_10

← 1 2 3 4 5 6 7 →