Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

被引:5
作者
Zhang, Frank [1 ]
Wang, Yongqiang [1 ]
Zhang, Xiaohui [1 ]
Liu, Chunxi [1 ]
Saraf, Yatharth [1 ]
Zweig, Geoffrey [1 ]
机构
[1] Facebook AI, Menlo Pk, CA 94025 USA
来源
INTERSPEECH 2020 | 2020年
关键词
hybrid speech recognition; CTC; acoustic modeling; wordpiece; transformer; recurrent neural networks;
D O I
10.21437/Interspeech.2020-1995
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal VideoASR datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.
引用
收藏
页码:976 / 980
页数:5
相关论文
共 48 条
  • [1] Abdel-Hamid O., 2014, IEEE TASLP
  • [2] Amodei D, 2016, PR MACH LEARN RES, V48
  • [3] Andrew H., 2015, P ASRU
  • [4] [Anonymous], 2011, P ICASSP
  • [5] [Anonymous], 2017, P INTERSPEECH
  • [6] [Anonymous], 2015, P ICASSP
  • [7] [Anonymous], HYBRID CTC ATTENTION
  • [8] [Anonymous], 2019, P INTERSPEECH
  • [9] [Anonymous], 2015, ARXIV151002693
  • [10] [Anonymous], 2016, P ICASSP