Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition

被引:13
作者
Weng, Chao [1 ]
Yu, Chengzhu [1 ]
Cui, Jia [1 ]
Zhang, Chunlei [1 ]
Yu, Dong [1 ]
机构
[1] Tencent AI Lab, Bellevue, WA 98004 USA
来源
INTERSPEECH 2020 | 2020年
关键词
RNN-T; transformer; end-to-end speech recognition; sequential minimum Bayes risk training; MBR; shallow fusion; LVCSR;
D O I
10.21437/Interspeech.2020-1221
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this work, we propose minimum Bayes risk (MBR) training of RNN-Transducer (RNN-T) for end-to-end speech recognition. Specifically, initialized with a RNN-T trained model, MBR training is conducted via minimizing the expected edit distance between the reference label sequence and on-the-fly generated N-best hypothesis. We also introduce a heuristic to incorporate an external neural network language model (NNLM) in RNN-T beam search decoding and explore MBR training with the external NNLM. Experimental results demonstrate an MBR trained model outperforms a RNN-T trained model substantially and further improvements can be achieved if trained with an external NNLM. Our best MBR trained system achieves absolute character error rate (CER) reductions of 1.2% and 0.5% on read and spontaneous Mandarin speech respectively over a strong convolution and transformer based RNN-T baseline trained on similar to 21,000 hours of speech.
引用
收藏
页码:966 / 970
页数:5
相关论文
共 36 条
[1]  
[Anonymous], 2015, Attention-based models for speech recognition
[2]  
[Anonymous], 2017, BUILDING COMPETITIVE
[3]  
[Anonymous], 2017, OPTIMIZING EXPECTED
[4]  
[Anonymous], 2003, THESIS
[5]  
Aravind CV, 2018, SPRINGERBRIEF ENERG, P1, DOI 10.1007/978-981-13-0435-4_1
[6]   Direct Acoustics-to-Word Models for English Conversational Speech Recognition [J].
Audhkhasi, Kartik ;
Ramabhadran, Bhuvana ;
Saon, George ;
Picheny, Michael ;
Nahamoo, David .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :959-963
[7]  
Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)
[8]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[9]  
Chen K, 2016, INT CONF ACOUST SPEE, P5880, DOI 10.1109/ICASSP.2016.7472805
[10]  
Chiu C.-C., 2017, State-of-the-art speech recognition with sequence-to-sequence models