Speech emotion recognition using deep 1D & 2D CNN LSTM networks

被引:621
作者
Zhao, Jianfeng [1 ,2 ]
Mao, Xia [1 ]
Chen, Lijiang [1 ]
机构
[1] Beihang Univ, Sch Elect & Informat Engn, Mailbox 206,37 XueYuan Rd, Beijing 100083, Peoples R China
[2] Inner Mongolia Univ Sci & Technol, Sch Informat Engn, Baotou 014010, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; CNN LSTM network; Raw audio clips; Log-mel spectrograms; SPECTRAL FEATURES; ACOUSTIC FEATURE; CLASSIFICATION; SELECTION;
D O I
10.1016/j.bspc.2018.08.035
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
We aimed at learning deep emotion features to recognize speech emotion. Two convolutional neural network and long short-term memory (CNN LSTM) networks, one 1D CNN LSTM network and one 2D CNN LSTM network, were constructed to learn local and global emotion-related features from speech and logmel spectrogram respectively. The two networks have the similar architecture, both consisting of four local feature learning blocks (LFLBs) and one long short-term memory (LSTM) layer. LFLB, which mainly contains one convolutional layer and one max-pooling layer, is built for learning local correlations along with extracting hierarchical correlations. LSTM layer is adopted to learn long-term dependencies from the learned local features. The designed networks, combinations of the convolutional neural network (CNN) and LSTM, can take advantage of the strengths of both networks and overcome the shortcomings of them, and are evaluated on two benchmark databases. The experimental results show that the designed networks achieve excellent performance on the task of recognizing speech emotion, especially the 2D CNN LSTM network outperforms the traditional approaches, Deep Belief Network (DBN) and CNN on the selected databases. The 2D CNN LSTM network achieves recognition accuracies of 95.33% and 95.89% on Berlin EmoDB of speaker-dependent and speaker-independent experiments respectively, which compare favourably to the accuracy of 91.6% and 92.9% obtained by traditional approaches; and also yields recognition accuracies of 89.16% and 52.14% on IEMOCAP database of speaker-dependent and speaker-independent experiments, which are much higher than the accuracy of 73.78% and 40.02% obtained by DBN and CNN. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:312 / 323
页数:12
相关论文
共 67 条
[1]   Model selection for ecologists: the worldviews of AIC and BIC [J].
Aho, Ken ;
Derryberry, DeWayne ;
Peterson, Teri .
ECOLOGY, 2014, 95 (03) :631-636
[2]   Overtraining in back-propagation neural networks: A CRT color calibration example [J].
Alman, DH ;
Liao, NF .
COLOR RESEARCH AND APPLICATION, 2002, 27 (02) :122-125
[3]   Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 [J].
Anagnostopoulos, Christos-Nikolaos ;
Iliou, Theodoros ;
Giannoukos, Ioannis .
ARTIFICIAL INTELLIGENCE REVIEW, 2015, 43 (02) :155-177
[4]  
[Anonymous], QUANTA MAGAZINE
[5]  
[Anonymous], LANG RESOUR EVAL
[6]  
[Anonymous], TCDCS200541
[7]  
[Anonymous], 2015, ARXIV PREPRINT ARXIV
[8]  
[Anonymous], INT J ENG TECHNOL
[9]  
[Anonymous], 2017, DEEPXPLORE AUTOMATED
[10]  
[Anonymous], SPEECH EMOTION RECOG