Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

被引:77
作者
Kannan, Anjuli [1 ]
Datta, Arindrima [1 ]
Sainath, Tara N. [1 ]
Weinstein, Eugene [1 ]
Ramabhadran, Bhuvana [1 ]
Wu, Yonghui [1 ]
Bapna, Ankur [1 ]
Chen, Zhifeng [1 ]
Lee, Seungji [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
来源
INTERSPEECH 2019 | 2019年
关键词
speech recognition; multilingual; RNN-T; residual adapter;
D O I
10.21437/Interspeech.2019-2858
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and serving by eliminating language-specific acoustic, pronunciation, and language models. This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages. Using nine Indic languages, we compare a variety of techniques, and find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model achieves a lower word error rate (WER) than both monolingual E2E models (eight of nine languages) and monolingual conventional systems (all nine languages).
引用
收藏
页码:2130 / 2134
页数:5
相关论文
共 35 条
[1]  
Alumae T., 2016, INTERSPEECH
[2]  
[Anonymous], 2019, SIMPLE SCALABLE ADAP
[3]  
Chan W., 2015, Listen, attend and spell
[4]  
Chen D., 2015, IEEE ACM T AUDIO SPE
[5]  
Cho J., 2018, ARXIV181003459
[6]  
Cui J., 2015, ASRU
[7]  
Cui J., 2017, ASRU
[8]  
Emond J., 2018, SLT
[9]  
Fugen C., 2003, 2003 IEEE WORKSH AUT
[10]  
Garcia-Moral I., 2011, IEEE ACM T AUDIO SPE