BLSTM and CNN Stacking Architecture for Speech Emotion Recognition

被引:0
作者
Dongdong Li
Linyu Sun
Xinlei Xu
Zhe Wang
Jing Zhang
Wenli Du
机构
[1] Ministry of Education,Key Laboratory of Advanced Control and Optimization for Chemical Processes
[2] East China University of Science and Technology,Department of Computer Science and Engineering
[3] East China University of Science and Technology,Provincial Key Laboratory for Computer Information Processing Technology
[4] Soochow University,undefined
来源
Neural Processing Letters | 2021年 / 53卷
关键词
Speech emotion recognition; Convolutional neural network; Bidirectional long short term memory; Stacking;
D O I
暂无
中图分类号
学科分类号
摘要
Speech Emotion Recognition (SER) is a huge challenge for distinguishing and interpreting the sentiments carried in speech. Fortunately, deep learning is proved to have great ability to deal with acoustic features. For instance, Bidirectional Long Short Term Memory (BLSTM) has an advantage of solving time series acoustic features and Convolutional Neural Network (CNN) can discover the local structure among different features. This paper proposed the BLSTM and CNN Stacking Architecture (BCSA) to enhance the ability to recognition emotions. In order to match the input formats of BLSTM and CNN, slicing feature matrices is necessary. For utilizing the different roles of the BLSTM and CNN, the Stacking is employed to integrate the BLSTM and CNN. In detail, taking into account overfitting problem, the estimates of probabilistic quantities from BLSTM and CNN are combined as new data using K-fold cross validation. Finally, based on the Stacking models, the logistic regression is used to recognize emotions effectively by fitting the new data. The experiment results demonstrate that the performance of proposed architecture is better than that of single model. Furthermore, compared with the state-of-the-art model on SER in our knowledge, the proposed method BCSA may be more suitable for SER by integrating time series acoustic features and the local structure among different features.
引用
收藏
页码:4097 / 4115
页数:18
相关论文
共 50 条
[31]   Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition [J].
Seo, Minji ;
Kim, Myungho .
SENSORS, 2020, 20 (19) :1-21
[32]   Real Time Emotion Recognition from Facial Expressions Using CNN Architecture [J].
Ozdemir, Mehmet Akif ;
Elagoz, Berkay ;
Alaybeyoglu, Aysegul ;
Sadighzadeh, Reza ;
Akan, Aydin .
2019 MEDICAL TECHNOLOGIES CONGRESS (TIPTEKNO), 2019, :417-420
[33]   Emotion Prompting for Speech Emotion Recognition [J].
Zhou, Xingfa ;
Li, Min ;
Yang, Lan ;
Sun, Rui ;
Wang, Xin ;
Zhan, Huayi .
INTERSPEECH 2023, 2023, :3108-3112
[34]   Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders [J].
Makhmudov, Fazliddin ;
Kutlimuratov, Alpamis ;
Akhmedov, Farkhod ;
Abdallah, Mohamed S. ;
Cho, Young-Im .
ELECTRONICS, 2022, 11 (23)
[35]   FRONTEND ATTRIBUTES DISENTANGLEMENT FOR SPEECH EMOTION RECOGNITION [J].
Xi, Yu-Xuan ;
Song, Yan ;
Dai, Li-Rong ;
McLoughlin, Ian ;
Liu, Lin .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7712-7716
[36]   USING REGIONAL SALIENCY FOR SPEECH EMOTION RECOGNITION [J].
Aldeneh, Zakaria ;
Provost, Emily Mower .
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, :2741-2745
[37]   SPEECH EMOTION RECOGNITION WITH COMPLEMENTARY ACOUSTIC REPRESENTATIONS [J].
Zhang, Xiaoming ;
Zhang, Fan ;
Cui, Xiaodong ;
Zhang, Wei .
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, :846-852
[38]   EESpectrum Feature Representations for Speech Emotion Recognition [J].
Zhao, Ziping ;
Zhao, Yiqin ;
Bao, Zhongtian ;
Wang, Haishuai ;
Zhang, Zixing ;
Li, Chao .
PROCEEDINGS OF THE JOINT WORKSHOP OF THE 4TH WORKSHOP ON AFFECTIVE SOCIAL MULTIMEDIA COMPUTING AND FIRST MULTI-MODAL AFFECTIVE COMPUTING OF LARGE-SCALE MULTIMEDIA DATA (ASMMC-MMAC'18), 2018, :27-33
[39]   A Path Signature Approach for Speech Emotion Recognition [J].
Wang, Bo ;
Liakata, Maria ;
Ni, Hao ;
Lyons, Terry ;
Nevado-Holgado, Alejo J. ;
Saunders, Kate .
INTERSPEECH 2019, 2019, :1661-1665
[40]   Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models [J].
Bhanbhro, Jamsher ;
Memon, Asif Aziz ;
Lal, Bharat ;
Talpur, Shahnawaz ;
Memon, Madeha .
SIGNALS, 2025, 6 (02)