BLSTM and CNN Stacking Architecture for Speech Emotion Recognition

被引：0

作者：

Dongdong Li

Linyu Sun

Xinlei Xu

Zhe Wang

Jing Zhang

Wenli Du

机构：

[1] Ministry of Education,Key Laboratory of Advanced Control and Optimization for Chemical Processes

[2] East China University of Science and Technology,Department of Computer Science and Engineering

[3] East China University of Science and Technology,Provincial Key Laboratory for Computer Information Processing Technology

[4] Soochow University,undefined

来源：

Neural Processing Letters | 2021年 / 53卷

关键词：

Speech emotion recognition; Convolutional neural network; Bidirectional long short term memory; Stacking;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Speech Emotion Recognition (SER) is a huge challenge for distinguishing and interpreting the sentiments carried in speech. Fortunately, deep learning is proved to have great ability to deal with acoustic features. For instance, Bidirectional Long Short Term Memory (BLSTM) has an advantage of solving time series acoustic features and Convolutional Neural Network (CNN) can discover the local structure among different features. This paper proposed the BLSTM and CNN Stacking Architecture (BCSA) to enhance the ability to recognition emotions. In order to match the input formats of BLSTM and CNN, slicing feature matrices is necessary. For utilizing the different roles of the BLSTM and CNN, the Stacking is employed to integrate the BLSTM and CNN. In detail, taking into account overfitting problem, the estimates of probabilistic quantities from BLSTM and CNN are combined as new data using K-fold cross validation. Finally, based on the Stacking models, the logistic regression is used to recognize emotions effectively by fitting the new data. The experiment results demonstrate that the performance of proposed architecture is better than that of single model. Furthermore, compared with the state-of-the-art model on SER in our knowledge, the proposed method BCSA may be more suitable for SER by integrating time series acoustic features and the local structure among different features.

引用

页码：4097 / 4115

页数：18

共 72 条

[1]

Calvo RA(2010)Affect detection: an interdisciplinary review of models, methods, and their applications IEEE Trans Affect Comput 1 18-37

[2]

Sidney D(2015)Deep learning Nature 521 436-97

[3]

Lecun Y(2012)Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups IEEE Signal Process Mag 29 82-52

[4]

Bengio Y(2019)Hierarchical attention based long short-term memory for Chinese lyric generation Appl Intell 49 44-12

[5]

Hinton G(2015)Emotion recognition from speech signals via a probabilistic echo-state network Pattern Recognit Lett 66 4-1590

[6]

Hinto G(2020)Attention-LSTM-attention model for speech emotion recognition and analysis of IEMOCAP database Electronics 9 713-1309

[7]

Li D(2018)Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching IEEE Trans Multimed 20 1576-259

[8]

Dong Y(2017)End-to-end multimodal emotion recognition using deep neural networks IEEE J Sel Top Signal Process 11 1301-359

[9]

Dahl GE(1992)Stacked generalization * Neural Networks 5 241-404

[10]

Mohamed AR(2008)Iemocap: interactive emotional dyadic motion capture database Lang Resour Eval 42 335-1078

← 1 2 3 4 5 6 7 8 →