Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

被引:21
作者
Futami, Hayato [1 ]
Inaguma, Hirofumi [1 ]
Ueno, Sei [1 ]
Mimura, Masato [1 ]
Sakai, Shinsuke [1 ]
Kawahara, Tatsuya [1 ]
机构
[1] Kyoto Univ, Grad Sch Informat, Sakyo Ku, Kyoto, Japan
来源
INTERSPEECH 2020 | 2020年
关键词
speech recognition; sequence-to-sequence models; language model; BERT; knowledge distillation;
D O I
10.21437/Interspeech.2020-1179
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Attention-based sequence-to-sequence (seq2seq) models have achieved promising results in automatic speech recognition (ASR). However, as these models decode in a left-to-right way, they do not have access to context on the right. We leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation. In our proposed method, BERT generates soft labels to guide the training of seq2seq ASR. Furthermore, we leverage context beyond the current utterance as input to BERT. Experimental evaluations show that our method significantly improves the ASR performance from the seq2seq baseline on the Corpus of Spontaneous Japanese (CSJ). Knowledge distillation from BERT outperforms that from a transformer LM that only looks at left context. We also show the effectiveness of leveraging context beyond the current utterance. Our method outperforms other LM application approaches such as n-best rescoring and shallow fusion, while it does not require extra inference cost.
引用
收藏
页码:3635 / 3639
页数:5
相关论文
共 36 条
[1]  
[Anonymous], 2017, ADV NEURAL INFORM PR
[2]  
Bai Y., 2019, INTEGRATING WHOLE CO
[3]   Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition [J].
Bai, Ye ;
Yi, Jiangyan ;
Tao, Jianhua ;
Tian, Zhengkun ;
Wen, Zhengqi .
INTERSPEECH 2019, 2019, :3795-3799
[4]  
Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937
[5]  
Bengio S.., 2015, Advances in Neural Information Processing Systems
[6]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[7]  
Chen Y., 2020, P 58 ANN M ASS COMP, P7893, DOI DOI 10.18653/V1/2020.ACL-MAIN.705
[8]  
Chorowski J. K, 2015, ADV NEURAL INFORM PR, V1, P577, DOI DOI 10.1016/0167-739X(94)90007-8
[9]   Towards better decoding and language model integration in sequence to sequence models [J].
Chorowski, Jan ;
Jaitly, Navdeep .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :523-527
[10]  
Clark Kevin, 2020, ARXIV200310555