A Robust Acoustic Feature Extraction Approach Based On Stacked Denoising Autoencoder

被引:10
作者
Liu, J. H. [1 ]
Zheng, W. Q. [1 ]
Zou, Y. X. [1 ]
机构
[1] Peking Univ, ADSPLAB ELIP, Sch Elect & Comp Engn, Shenzhen, Peoples R China
来源
2015 1ST IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM) | 2015年
关键词
robust acoustic feature extraction; stacked denoising autoencoder; noisy environment; speaker classification;
D O I
10.1109/BigMM.2015.46
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Acoustic feature extraction (AFE) is considered as one of the most challenging techniques for speech applications since the adverse environment noises always cause significant variation on the extracted acoustic features. In this paper, we propose a systematical AFE approach which based on stacked denoising autoencoder (SDAE) aiming at extracting acoustic features automatically. Denoising autoencoder (DAE), which is trained to reconstruct a clean "repaired" input from a corrupted version of it, works as the basic building block to form SDAE. Besides, the training set with clean and noisy speech ensures the SDAE has much powerful ability to extract the robust features under different noise conditions. Considering the speaker classification task using features extracted by the proposed approach for evaluation, intensive experiments have been conducted on TIMIT and NIST SRE 2004 to show SDAE with 3 hidden layers (3L-SDAE) gives better performance than shallow layers. The results also show that the features extracted by 3L-SDAE performs better than MFCC features when SNR is lower than 6dB and act more robustly when SNR decreases. What's more, for different types of noises at SNR of 0dB, the accuracy of speaker classification using 3L-SDAE features is higher than about 84% while MFCC features is lower than 77%.
引用
收藏
页码:124 / 127
页数:4
相关论文
共 14 条
[1]  
[Anonymous], CIRC SYST 2006 ISCAS
[2]  
[Anonymous], EXPT STUDY SPEECH EN
[3]  
Baldi P., 2012, Proceedings of ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proceedings of Machine Learning Research, P37
[4]  
Gold S., 1996, Journal of Artificial Neural Networks, V2, P381
[5]  
Hamel P., 2010, P INT SOC MUSIC INFO, P339
[6]   PERCEPTUAL LINEAR PREDICTIVE (PLP) ANALYSIS OF SPEECH [J].
HERMANSKY, H .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1990, 87 (04) :1738-1752
[7]   Reducing the dimensionality of data with neural networks [J].
Hinton, G. E. ;
Salakhutdinov, R. R. .
SCIENCE, 2006, 313 (5786) :504-507
[8]   ImageNet Classification with Deep Convolutional Neural Networks [J].
Krizhevsky, Alex ;
Sutskever, Ilya ;
Hinton, Geoffrey E. .
COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90
[9]   Face recognition: A convolutional neural-network approach [J].
Lawrence, S ;
Giles, CL ;
Tsoi, AC ;
Back, AD .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1997, 8 (01) :98-113
[10]   IRLbot: Scaling to 6 Billion Pages and Beyond [J].
Lee, Hsin-Tsang ;
Leonard, Derek ;
Wang, Xiaoming ;
Loguinov, Dmitri .
ACM TRANSACTIONS ON THE WEB, 2009, 3 (03)