Convolutional Neural Network with Spectrogram and Perceptual Features for Speech Emotion Recognition

被引：5

作者：

Zhang, Linjuan ^{[1
]}

Wang, Longbiao ^{[1
]}

Dang, Jianwu ^{[1
,2
]}

Guo, Lili ^{[1
]}

Guan, Haotian ^{[3
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China

[2] Japan Adv Inst Sci & Technol, Nomi, Ishikawa, Japan

[3] Intelligent Spoken Language Technol Tianjin Co Lt, Tianjin, Peoples R China

来源：

NEURAL INFORMATION PROCESSING (ICONIP 2018), PT IV | 2018年 / 11304卷

基金：

中国国家自然科学基金;

关键词：

Speech emotion recognition; Spectrogram; Perceptual features; Convolutional neural network; Bi-directional long short-term memory;

D O I：

10.1007/978-3-030-04212-7_6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Convolutional neural network (CNN) has demonstrated a great power at mining deep information from spectrogram for speech emotion recognition. However, perceptual features such as low-level descriptors (LLDs) and their statistical values were not utilized sufficiently in CNN-based emotion recognition. To solve this problem, we propose novel features to combine spectrogram and perceptual features in different levels. Firstly, frame-level LLDs are arranged as time-sequence LLDs. Then, spectrogram and time-sequence LLDs are fused as compositional spectrographic features (CSF). To fully utilize perceptual features and global information, statistical values of LLDs are added in CSF to generate rich-compositional spectrographic features (RSF). Finally, the proposed features are individually fed to CNN to extract deep features for emotion recognition. Bi-directional long short-term memory was employed to identify emotions and the experiments were conducted on EmoDB. Compared with spectrogram, CSF and RSF improve the unweighted accuracy by a relative error reduction of 32.04% and 36.91%, respectively.

引用

页码：62 / 71

页数：10

共 25 条

[1]

Amodei D, 2016, PR MACH LEARN RES, V48

[2]

[Anonymous], 2014, ABS1412556

[3]

[Anonymous], 2015, P AVEC15 BRISB AUSTR

[4]

[Anonymous], 2015, 16 ANN C INT SPEECH

[5]

[Anonymous], 2009, 10 ANN C INT SPEECH

[6]

Burkhardt F., 2005, Interspeech, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446

[7] Survey on speech emotion recognition: Features, classification schemes, and databases [J].

El Ayadi, Moataz ;

Kamel, Mohamed S. ;

Karray, Fakhri .

PATTERN RECOGNITION, 2011, 44 (03) :572-587

[8] Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network [J].

Guo, Lili ;

Wang, Longbiao ;

Dang, Jianwu ;

Zhang, Linjuan ;

Guan, Haotian ;

Li, Xiangang .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1611-1615

[9]

Guo LL, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2666, DOI 10.1109/ICASSP.2018.8462219

[10]

Han K, 2014, INTERSPEECH, P223

← 1 2 3 →