Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition

被引：59

作者：

Huang, Yongming ^{[1
,2
]}

Tian, Kexin ^{[1
,2
]}

Wu, Ao ^{[1
,2
]}

Zhang, Guobao ^{[1
,2
]}

机构：

[1] Southeast Univ, Lab Measurement & Control Complex Syst Engn, Nanjing, Jiangsu, Peoples R China

[2] Southeast Univ, Sch Automat, Minist Educ, Nanjing 210096, Jiangsu, Peoples R China

来源：

JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING | 2019年 / 10卷 / 05期

关键词：

Speech emotion recognition; Weighted wavelet packets Cepstral coefficients (W-WPCC); Feature fusion; Deep belief networks (DBNs); CHINESE SPEECH; SVM; CLASSIFICATION;

D O I：

10.1007/s12652-017-0644-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The speech emotion recognition accuracy of prosody feature and voice quality feature declines with the decrease of signal to noise ratio (SNR) of speech signals. In this paper, we propose novel sub-band spectral centroid weighted wavelet packet Cepstral coefficients (W-WPCC) for robust speech emotion recognition. The W-WPCC feature is computed by combining the sub-band energies with sub-band spectral centroids via a weighting scheme to generate noise-robust acoustic features. And deep belief networks (DBNs) are artificial neural networks having more than one hidden layer, which are first pre-trained layer by layer and then fine-tuned using back propagation algorithm. The well-trained deep neural networks are capable of modeling complex and non-linear features of input training data and can better predict the probability distribution over classification labels. We extracted prosody feature, voice quality features and wavelet packet Cepstral coefficients (WPCC) from the speech signals to combine with W-WPCC and fused them by DBNs. Experimental results on Berlin emotional speech database show that the proposed fused feature with W-WPCC is more suitable in speech emotion recognition under noisy conditions than other acoustics features and proposed DBNs feature learning structure combined with W-WPCC improve emotion recognition performance over the conventional emotion recognition method.

引用

页码：1787 / 1798

页数：12

共 45 条

[11] Mel filter-like admissible wavelet packet structure for speech recognition
Farooq, O
Datta, S
[J]. IEEE SIGNAL PROCESSING LETTERS, 2001, 8 (07) : 196 - 198
[12] Fastl H., 1999, Psychoacoustics-Facts and Models, V2nd
[13] On Extended Dissipativity of Discrete-Time Neural Networks With Time Delay
Feng, Zhiguang
Zheng, Wei Xing
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2015, 26 (12) : 3293 - 3300
[14] Acoustical properties of speech as indicators of depression and suicidal risk
France, DJ
Shiavi, RG
Silverman, S
Silverman, M
Wilkes, DM
[J]. IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 2000, 47 (07) : 829 - 837
[15] Influence on Spectral Energy Distribution of Emotional Expression
Guzman, Marco
Correa, Soledad
Munoz, Daniel
Mayerhoff, Ross
[J]. JOURNAL OF VOICE, 2013, 27 (01) : 129.e1 - 129.e10
[16] On Acoustic Emotion Recognition: Compensating for Covariate Shift
Hassan, Ali
Damper, Robert
Niranjan, Mahesan
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (07): : 1458 - 1468
[17] Reducing the dimensionality of data with neural networks
Hinton, G. E.
Salakhutdinov, R. R.
[J]. SCIENCE, 2006, 313 (5786) : 504 - 507
[18] Huang YM, 2014, COMM COM INF SC, V484, P436
[19] Huang YM, 2014, LECT NOTES COMPUT SC, V8588, P706, DOI 10.1007/978-3-319-09333-8_77
[20] Idris I, 2015, J INF ASSUR SECUR, V10, P183

← 1 2 3 4 5 →