Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS

被引：7

作者：

Deng, Yan ^{[1
]}

Zhao, Rui ^{[2
]}

Meng, Zhong ^{[2
]}

Chen, Xie ^{[2
]}

Liu, Bing ^{[1
]}

Li, Jinyu ^{[2
]}

Gong, Yifan ^{[2
]}

He, Lei ^{[1
]}

机构：

[1] Microsoft, Beijing, Peoples R China

[2] Microsoft, Redmond, WA USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

RNN-T; customization; semi-supervised training; neural TTS; SPEECH;

D O I：

10.21437/Interspeech.2021-1017

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recurrent neural network transducer (RNN-T) has shown to be comparable with conventional hybrid model for speech recognition. However, there is still a challenge in out-of-domain scenarios with context or words different from training data. In this paper, we explore the semi-supervised training which optimizes RNN-T jointly with neural text-to-speech (TTS) to better generalize to new domains using domain-specific text data. We apply the method to two tasks: one with out-of-domain context and the other with significant out-of-vocabulary (OOV) words. The results show that the proposed method significantly improves the recognition accuracy in both tasks, resulting in 61.4% and 53.8% relative word error rate (WER) reductions respectively, from a well-trained RNN-T with 65 thousand hours of training data. We do further study on the semi-supervised training methodology: 1) which modules of RNN-T model to be updated; 2) the impact of using different neural TTS models; 3) the performance of using text with different relevancy to target domain. Finally, we compare several RNN-T customization methods, and conclude that semi-supervised training with neural TTS is comparable and complementary with Internal Language Model Estimation (ILME) or biasing.

引用

页码：751 / 755

页数：5

共 34 条

[1]

Aleksic P, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P468

[2]

Aleksic P, 2015, INT CONF ACOUST SPEE, P5172, DOI 10.1109/ICASSP.2015.7178957

[3]

[Anonymous], 2012, INT C MACHINE LEARNI

[4]

Baskar M. K., 2019, ARXIV190501152

[5]

Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937

[6]

Bell P., 2021, IEEE OPEN J SINGAL P, V2

[7]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[8]

Chen M., 2021, 2021 INT C LEARNING

[9] Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection [J].

Chen, Zhehuai ;

Rosenberg, Andrew ;

Zhang, Yu ;

Wang, Gary ;

Ramabhadran, Bhuvana ;

Moreno, Pedro J. .

INTERSPEECH 2020, 2020, :556-560

[10]

Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105

← 1 2 3 4 →