Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS

被引:5
作者
Deng, Yan [1 ]
Zhao, Rui [2 ]
Meng, Zhong [2 ]
Chen, Xie [2 ]
Liu, Bing [1 ]
Li, Jinyu [2 ]
Gong, Yifan [2 ]
He, Lei [1 ]
机构
[1] Microsoft, Beijing, Peoples R China
[2] Microsoft, Redmond, WA USA
来源
INTERSPEECH 2021 | 2021年
关键词
RNN-T; customization; semi-supervised training; neural TTS; SPEECH;
D O I
10.21437/Interspeech.2021-1017
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recurrent neural network transducer (RNN-T) has shown to be comparable with conventional hybrid model for speech recognition. However, there is still a challenge in out-of-domain scenarios with context or words different from training data. In this paper, we explore the semi-supervised training which optimizes RNN-T jointly with neural text-to-speech (TTS) to better generalize to new domains using domain-specific text data. We apply the method to two tasks: one with out-of-domain context and the other with significant out-of-vocabulary (OOV) words. The results show that the proposed method significantly improves the recognition accuracy in both tasks, resulting in 61.4% and 53.8% relative word error rate (WER) reductions respectively, from a well-trained RNN-T with 65 thousand hours of training data. We do further study on the semi-supervised training methodology: 1) which modules of RNN-T model to be updated; 2) the impact of using different neural TTS models; 3) the performance of using text with different relevancy to target domain. Finally, we compare several RNN-T customization methods, and conclude that semi-supervised training with neural TTS is comparable and complementary with Internal Language Model Estimation (ILME) or biasing.
引用
收藏
页码:751 / 755
页数:5
相关论文
共 34 条
  • [31] Wang G, 2020, INT CONF ACOUST SPEE, P7029, DOI [10.1109/ICASSP40776.2020.9053831, 10.1109/icassp40776.2020.9053831]
  • [32] Xue J, 2013, INTERSPEECH, P2364
  • [33] Zhao R., 2021, INTERSPEECH
  • [34] Zheng X., 2021, P ICASSP