An unsupervised adaptation method for deep neural network-based large vocabulary continuous speech recognition

被引:0
作者
Xiao, Yeming [1 ]
Si, Yujing [1 ]
Xu, Ji [1 ]
Pan, Jielin [1 ]
Yan, Yonghong [1 ]
机构
[1] Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing
来源
Journal of Information and Computational Science | 2014年 / 11卷 / 14期
基金
中国国家自然科学基金;
关键词
Deep neural network; Regularized training; Speech recognition; Unsupervised adaptation;
D O I
10.12733/jics20104666
中图分类号
学科分类号
摘要
Recently, the Deep Neural Network (DNN) to acoustic modeling has achieved great success on Large Vocabulary Continuous Speech Recognition (LVCSR) tasks. However, the performance of the DNN-based Automatic Speech Recognition (ASR) systems still suffer greatly from the mismatch between training and testing in real applications. In the past, commonly used methods for DNN-based acoustic model adaptation have focused on supervised methods such as affine transformation, regularize training. There has been little work on unsupervised adaptation methods for DNN. However, in many cases, it is very expensive to get the matched data with transcription, but there are tremendous unlabelled data available. In this paper, a novel unsupervised adaptation approach is proposed to mitigate the effects of the mismatch. To be specifically, the original DNN is adapted with these acoustic observations of the unlabelled data, and the boosted posterior probabilities generated with the original DNN are used as training targets. With around 1000 hour unlabelled data used for adaptation, experiments results on a Mandarin voice search recognition task demonstrate the effectiveness of the proposed adaptation technique. Compared to the baseline, the adapted DNN achieve a 10% relative Character Error Rate (CER) reduction. 1548-7741/Copyright © 2014 Binary Information Press
引用
收藏
页码:4889 / 4899
页数:10
相关论文
共 19 条
[1]  
Bourlard H., Konig Y., Morgan N., A training algorithm for statistical sequence recognition with applications to transition-based speech recognition, Signal Processing Letters, 3, 7, pp. 203-205, (1996)
[2]  
Bourlard H., Morgan N., Wooters C., Renals S., CDNN: A context dependent neural network for continuous speech recognition, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, pp. 349-352, (1992)
[3]  
Dahl G.E., Yu D., Deng L., Acero A., Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, 20, 1, pp. 30-42, (2012)
[4]  
Dupont S., Cheboub L., Fast speaker adaptation of artificial neural networks for automatic speech recognition, IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP'00 Proceedings, 3, pp. 1795-1798, (2000)
[5]  
Gauvain J.L., Lee C.-H., Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains, IEEE Transactions on Speech and Audio Processing, 2, 2, pp. 291-298, (1994)
[6]  
Gemello R., Mana F., Scanzio S., Laface P., de Mori R., Linear hidden transformations for adaptation of hybrid ANN/HMM models, Speech Communication, 49, 10, pp. 827-835, (2007)
[7]  
Hinton G., Deng L., Yu D., Dahl G.E., Mohamed A.-R., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T.N., Et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, Signal Processing Magazine, 29, 6, pp. 82-97, (2012)
[8]  
Kingsbury B., Sainath T.N., Soltau H., Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization, (2012)
[9]  
Leggetter C.J., Woodland P.C., Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models, Computer Speech & Language, 9, 2, pp. 171-185, (1995)
[10]  
Li B., Sim K.C., Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems, pp. 526-529, (2010)