Convolutional Neural Network with Spectrogram and Perceptual Features for Speech Emotion Recognition

被引:5
作者
Zhang, Linjuan [1 ]
Wang, Longbiao [1 ]
Dang, Jianwu [1 ,2 ]
Guo, Lili [1 ]
Guan, Haotian [3 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China
[2] Japan Adv Inst Sci & Technol, Nomi, Ishikawa, Japan
[3] Intelligent Spoken Language Technol Tianjin Co Lt, Tianjin, Peoples R China
来源
NEURAL INFORMATION PROCESSING (ICONIP 2018), PT IV | 2018年 / 11304卷
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; Spectrogram; Perceptual features; Convolutional neural network; Bi-directional long short-term memory;
D O I
10.1007/978-3-030-04212-7_6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Convolutional neural network (CNN) has demonstrated a great power at mining deep information from spectrogram for speech emotion recognition. However, perceptual features such as low-level descriptors (LLDs) and their statistical values were not utilized sufficiently in CNN-based emotion recognition. To solve this problem, we propose novel features to combine spectrogram and perceptual features in different levels. Firstly, frame-level LLDs are arranged as time-sequence LLDs. Then, spectrogram and time-sequence LLDs are fused as compositional spectrographic features (CSF). To fully utilize perceptual features and global information, statistical values of LLDs are added in CSF to generate rich-compositional spectrographic features (RSF). Finally, the proposed features are individually fed to CNN to extract deep features for emotion recognition. Bi-directional long short-term memory was employed to identify emotions and the experiments were conducted on EmoDB. Compared with spectrogram, CSF and RSF improve the unweighted accuracy by a relative error reduction of 32.04% and 36.91%, respectively.
引用
收藏
页码:62 / 71
页数:10
相关论文
共 25 条
[1]  
Amodei D, 2016, PR MACH LEARN RES, V48
[2]  
[Anonymous], 2014, ABS1412556
[3]  
[Anonymous], 2015, P AVEC15 BRISB AUSTR
[4]  
[Anonymous], 2015, 16 ANN C INT SPEECH
[5]  
[Anonymous], 2009, 10 ANN C INT SPEECH
[6]  
Burkhardt F., 2005, Interspeech, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[7]   Survey on speech emotion recognition: Features, classification schemes, and databases [J].
El Ayadi, Moataz ;
Kamel, Mohamed S. ;
Karray, Fakhri .
PATTERN RECOGNITION, 2011, 44 (03) :572-587
[8]   Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network [J].
Guo, Lili ;
Wang, Longbiao ;
Dang, Jianwu ;
Zhang, Linjuan ;
Guan, Haotian ;
Li, Xiangang .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1611-1615
[9]  
Guo LL, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2666, DOI 10.1109/ICASSP.2018.8462219
[10]  
Han K, 2014, INTERSPEECH, P223