Distilling knowledge from ensembles of neural networks for speech recognition

被引：118

作者：

Chebotar, Yevgen ^{[1
]}

Waters, Austin ^{[2
]}

机构：

[1] Univ Southern Calif, Los Angeles, CA 90007 USA

[2] Google Inc, New York, NY USA

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

关键词：

acoustic modeling; knowledge distillation; ensembles; deep neural networks; long short-term memory;

D O I：

10.21437/Interspeech.2016-1190

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech recognition systems that combine multiple types of acoustic models have been shown to outperform single-model systems. However, such systems can be complex to implement and too resource-intensive to use in production. This paper describes how to use knowledge distillation to combine acoustic models in a way that has the best of many worlds: It improves recognition accuracy significantly, can be implemented with standard training tools, and requires no additional complexity during recognition. First, we identify a simple but particularly strong type of ensemble: a late combination of recurrent neural networks with different architectures and training objectives. To harness such an ensemble, we use a variant of standard cross-entropy training to distill it into a single model and then discriminatively fine-tune the result. An evaluation on 2,000-hour large vocabulary tasks in 5 languages shows that the distilled models provide up to 8.9% relative WER improvement over conventionally-trained baselines with an identical number of parameters.

引用

页码：3439 / 3443

页数：5

共 20 条

[1]

[Anonymous], 2014, P INTERSPEECH

[2]

[Anonymous], TRANSFERRING KNOWLED

[3]

[Anonymous], 2015, SPEECH AUDIO PROCESS

[4]

Banfield R. E., 2005, Information Fusion, V6, P49, DOI 10.1016/j.inffus.2004.04.005

[5]

Bucilua C., 2006, KDD

[6]

Dean J., 2012, ADV NEURAL INFORM PR, V25, P1223, DOI DOI 10.5555/2999134.2999271

[7] A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER) [J].

Fiscus, JG .

1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, 1997, :347-354

[8]

Hinton G., 2015, ARXIV

[9]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[10] LATTICE-BASED OPTIMIZATION OF SEQUENCE CLASSIFICATION CRITERIA FOR NEURAL-NETWORK ACOUSTIC MODELING [J].

Kingsbury, Brian .

2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, :3761-3764

← 1 2 →