Distilling knowledge from ensembles of neural networks for speech recognition

被引:118
作者
Chebotar, Yevgen [1 ]
Waters, Austin [2 ]
机构
[1] Univ Southern Calif, Los Angeles, CA 90007 USA
[2] Google Inc, New York, NY USA
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
关键词
acoustic modeling; knowledge distillation; ensembles; deep neural networks; long short-term memory;
D O I
10.21437/Interspeech.2016-1190
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech recognition systems that combine multiple types of acoustic models have been shown to outperform single-model systems. However, such systems can be complex to implement and too resource-intensive to use in production. This paper describes how to use knowledge distillation to combine acoustic models in a way that has the best of many worlds: It improves recognition accuracy significantly, can be implemented with standard training tools, and requires no additional complexity during recognition. First, we identify a simple but particularly strong type of ensemble: a late combination of recurrent neural networks with different architectures and training objectives. To harness such an ensemble, we use a variant of standard cross-entropy training to distill it into a single model and then discriminatively fine-tune the result. An evaluation on 2,000-hour large vocabulary tasks in 5 languages shows that the distilled models provide up to 8.9% relative WER improvement over conventionally-trained baselines with an identical number of parameters.
引用
收藏
页码:3439 / 3443
页数:5
相关论文
共 20 条
[1]  
[Anonymous], 2014, P INTERSPEECH
[2]  
[Anonymous], TRANSFERRING KNOWLED
[3]  
[Anonymous], 2015, SPEECH AUDIO PROCESS
[4]  
Banfield R. E., 2005, Information Fusion, V6, P49, DOI 10.1016/j.inffus.2004.04.005
[5]  
Bucilua C., 2006, KDD
[6]  
Dean J., 2012, ADV NEURAL INFORM PR, V25, P1223, DOI DOI 10.5555/2999134.2999271
[7]   A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER) [J].
Fiscus, JG .
1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, 1997, :347-354
[8]  
Hinton G., 2015, ARXIV
[9]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[10]   LATTICE-BASED OPTIMIZATION OF SEQUENCE CLASSIFICATION CRITERIA FOR NEURAL-NETWORK ACOUSTIC MODELING [J].
Kingsbury, Brian .
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, :3761-3764