Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

被引：5

作者：

Zhang, Frank ^{[1
]}

Wang, Yongqiang ^{[1
]}

Zhang, Xiaohui ^{[1
]}

Liu, Chunxi ^{[1
]}

Saraf, Yatharth ^{[1
]}

Zweig, Geoffrey ^{[1
]}

机构：

[1] Facebook AI, Menlo Pk, CA 94025 USA

来源：

INTERSPEECH 2020 | 2020年

关键词：

hybrid speech recognition; CTC; acoustic modeling; wordpiece; transformer; recurrent neural networks;

D O I：

10.21437/Interspeech.2020-1995

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal VideoASR datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.

引用

页码：976 / 980

页数：5