Multi-type features separating fusion learning for Speech Emotion Recognition

被引:15
作者
Xu, Xinlei [1 ,2 ]
Li, Dongdong [2 ]
Zhou, Yijun [2 ]
Wang, Zhe [1 ,2 ]
机构
[1] East China Univ Sci Technol, Key Lab Smart Mfg Energy Chem Proc, Minist Educ, Shanghai 200237, Peoples R China
[2] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; Hybrid feature selection; Feature-level fusion; Speaker-independent; CONVOLUTIONAL NEURAL-NETWORKS; GMM; REPRESENTATIONS; CLASSIFICATION; ADAPTATION; RECURRENT; CNN;
D O I
10.1016/j.asoc.2022.109648
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech Emotion Recognition (SER) is a challengeable task to improve human-computer interaction. Speech data have different representations, and choosing the appropriate features to express the emotion behind the speech is difficult. The human brain can comprehensively judge the same thing in different dimensional representations to obtain the final result. Inspired by this, we believe that it is reasonable to have complementary advantages between different representations of speech data. Therefore, a Hybrid Deep Learning with Multi-type features Model (HD-MFM) is proposed to integrate the acoustic, temporal and image information of speech. Specifically, we utilize Convolutional Neural Network (CNN) to extract image information from the spectrogram of speech. Deep Neural Network (DNN) is used for extracting the acoustic information from the statistic features of speech. Then, Long Short-Term Memory (LSTM) is chosen to extract the temporal information from the Mel-Frequency Cepstral Coefficients (MFCC) of speech. Finally, three different types of speech features are concatenated together to get a richer emotion representation with better discriminative property. Considering that different fusion strategies affect the relationship between features, we consider two fusion strategies in this paper named separating and merging. To evaluate the feasibility and effectiveness of the proposed HD-MFM, we perform extensive experiments on EMO-DB and IEMOCAP of SER. The experimental results show that the separating method has more significant advantages in feature complementarity. The proposed HD-MFM obtains 91.25% and 72.02% results on EMO-DB and IEMOCAP. The obtained results indicate the proposed HD-MFM can make full use of the effective complementary feature representations by separating strategy to further enhance the performance of SER. (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:13
相关论文
共 69 条
  • [51] EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION
    Peng, Zixuan
    Lu, Yu
    Pan, Shengfeng
    Liu, Yunfeng
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3020 - 3024
  • [52] Speech emotion recognition approaches in human computer interaction
    Ramakrishnan, S.
    El Emary, Ibrahiem M. M.
    [J]. TELECOMMUNICATION SYSTEMS, 2013, 52 (03) : 1467 - 1478
  • [53] Rozgic V, 2012, ASIAPAC SIGN INFO PR
  • [54] Schuller B, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P336
  • [55] Segment-based approach to the recognition of emotions in speech
    Shami, MT
    Kamel, MS
    [J]. 2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, 2005, : 366 - 369
  • [56] A Novel Approach for Trajectory Tracking Control of an Under-Actuated Quad-Rotor UAV
    Shao, Ke
    Huang, Kang
    Zhen, Shengchao
    Sun, Hao
    Yu, Rongrong
    [J]. IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2024, 11 (09) : 2030 - 2032
  • [57] MFCC-based descriptor for bee queen presence detection
    Soares, Bianca Sousa
    Luz, Jederson Sousa
    de Macedo, Valderlandia Francisca
    Veloso e Silva, Romuere Rodrigues
    Duarte de Araujo, Flavio Henrique
    Vieira Magalhaes, Deborah Maria
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2022, 201
  • [58] Weighted spectral features based on local Hu moments for speech emotion recognition
    Sun, Yaxin
    Wen, Guihua
    Wang, Jiabing
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2015, 18 : 80 - 90
  • [59] Tang DK, 2018, INTERSPEECH, P162
  • [60] Vondra M, 2009, LECT NOTES COMPUT SC, V5641, P98, DOI 10.1007/978-3-642-03320-9_10