Multi-type features separating fusion learning for Speech Emotion Recognition

被引：15

作者：

Xu, Xinlei ^{[1
,2
]}

Li, Dongdong ^{[2
]}

Zhou, Yijun ^{[2
]}

Wang, Zhe ^{[1
,2
]}

机构：

[1] East China Univ Sci Technol, Key Lab Smart Mfg Energy Chem Proc, Minist Educ, Shanghai 200237, Peoples R China

[2] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China

来源：

APPLIED SOFT COMPUTING | 2022年 / 130卷

基金：

中国国家自然科学基金;

关键词：

Speech emotion recognition; Hybrid feature selection; Feature-level fusion; Speaker-independent; CONVOLUTIONAL NEURAL-NETWORKS; GMM; REPRESENTATIONS; CLASSIFICATION; ADAPTATION; RECURRENT; CNN;

D O I：

10.1016/j.asoc.2022.109648

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech Emotion Recognition (SER) is a challengeable task to improve human-computer interaction. Speech data have different representations, and choosing the appropriate features to express the emotion behind the speech is difficult. The human brain can comprehensively judge the same thing in different dimensional representations to obtain the final result. Inspired by this, we believe that it is reasonable to have complementary advantages between different representations of speech data. Therefore, a Hybrid Deep Learning with Multi-type features Model (HD-MFM) is proposed to integrate the acoustic, temporal and image information of speech. Specifically, we utilize Convolutional Neural Network (CNN) to extract image information from the spectrogram of speech. Deep Neural Network (DNN) is used for extracting the acoustic information from the statistic features of speech. Then, Long Short-Term Memory (LSTM) is chosen to extract the temporal information from the Mel-Frequency Cepstral Coefficients (MFCC) of speech. Finally, three different types of speech features are concatenated together to get a richer emotion representation with better discriminative property. Considering that different fusion strategies affect the relationship between features, we consider two fusion strategies in this paper named separating and merging. To evaluate the feasibility and effectiveness of the proposed HD-MFM, we perform extensive experiments on EMO-DB and IEMOCAP of SER. The experimental results show that the separating method has more significant advantages in feature complementarity. The proposed HD-MFM obtains 91.25% and 72.02% results on EMO-DB and IEMOCAP. The obtained results indicate the proposed HD-MFM can make full use of the effective complementary feature representations by separating strategy to further enhance the performance of SER. (c) 2022 Elsevier B.V. All rights reserved.

引用

页数：13

共 69 条

[11] Burkhardt F., 2005, P 9 EUR C SPEECH COM, V5, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[12] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[13] Text-Independent Phoneme Segmentation Combining EGG and Speech Data
Chen, Lijiang
Mao, Xia
Yan, Hong
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (06) : 1029 - 1037
[14] Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm
Daneshfar, Fatemeh
Kabudian, Seyed Jahanshah
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (1-2) : 1261 - 1289
[15] Demircan S., 2014, Journal of Advances in Computer Networks, V2, P34, DOI 10.7763/JACN.2014.V2.76
[16] Eyben F., 2013, P 21 ACM INT C MULT, P835, DOI [DOI 10.1145/2502081.2502224, https://doi.org/10.1145/2502081.2502224]
[17] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
Eyben, Florian
Scherer, Klaus R.
Schuller, Bjoern W.
Sundberg, Johan
Andre, Elisabeth
Busso, Carlos
Devillers, Laurence Y.
Epps, Julien
Laukka, Petri
Narayanan, Shrikanth S.
Truong, Khiet P.
[J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 190 - 202
[18] A survey of speech emotion recognition in natural environment
Fahad, Md. Shah
Ranjan, Ashish
Yadav, Jainath
Deepak, Akshay
[J]. DIGITAL SIGNAL PROCESSING, 2021, 110
[19] Adaptive Domain-Aware Representation Learning for Speech Emotion Recognition
Fan, Weiquan
Xu, Xiangmin
Xing, Xiaofen
Huang, Dongyan
[J]. INTERSPEECH 2020, 2020, : 4089 - 4093
[20] Evaluating deep learning architectures for Speech Emotion Recognition
Fayek, Haytham M.
Lech, Margaret
Cavedon, Lawrence
[J]. NEURAL NETWORKS, 2017, 92 : 60 - 68

← 1 2 3 4 5 6 7 →