Purely sequence-trained neural networks for ASR based on lattice-free MMI

被引:561
作者
Povey, Daniel [1 ,2 ]
Peddinti, Vijayaditya [1 ]
Galvez, Daniel [3 ]
Ghahremani, Pegah [1 ]
Manohar, Vimal [1 ]
Na, Xingyu [4 ]
Wang, Yiming [1 ]
Khudanpur, Sanjeev [1 ,2 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, HLT CoE, Baltimore, MD 21218 USA
[3] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
[4] Lele Innovat & Intelligence Technol Beijing Co, Beijing, Peoples R China
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
基金
美国国家科学基金会;
关键词
neural networks; sequence discriminative training;
D O I
10.21437/Interspeech.2016-595
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper we describe a method to perform sequence discriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. We use the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate. These changes enable us to perform the computation for the forward-backward algorithm on GPUs. Further the reduced output frame-rate also provides a significant speed-up during decoding. We present results on 5 different LVCSR tasks with training data ranging from 100 to 2100 hours. Models trained with LF-MMI provide a relative word error rate reduction of similar to 11.5%, over those trained with cross-entropy objective function, and similar to 8%, over those trained with cross-entropy and sMBR objective functions. A further reduction similar to 2.5%of relative, can be obtained by fine tuning these models with the word-lattice based sMBR objective function.
引用
收藏
页码:2751 / 2755
页数:5
相关论文
共 23 条
[1]  
[Anonymous], 2014, LONG SHORT TERM MEMO
[2]  
[Anonymous], 2015, P INTERSPEECH
[3]  
[Anonymous], 2015, P INTERSPEECH
[4]  
[Anonymous], 2005, THESIS
[5]  
[Anonymous], 2013, P INTERSPEECH
[6]  
Chen G., 2015, P INERSPEECH
[7]   Advances in speech transcription at IBM under the DARPA EARS program [J].
Chen, Stanley F. ;
Kingsbury, Brian ;
Mangu, Lidia ;
Povey, Daniel ;
Saon, George ;
Soltau, Hagen ;
Zweig, Geoffrey .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05) :1596-1608
[8]  
Graves A., 2006, INT C MACH LEARN
[9]  
Hannun A., 2014, ARIXV14125567
[10]  
Hopcroft John, 1971, THEORY MACHINES COMP, P189, DOI [DOI 10.1016/B978-0-12-417750-5.50022-1, 10.1016/B978-0-12-417750-5.50022-1]