Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation

被引:7
作者
Wang, Changhan [1 ]
Pino, Juan [1 ]
Gu, Jiatao [1 ]
机构
[1] Facebook AI, Menlo Pk, CA 94025 USA
来源
INTERSPEECH 2020 | 2020年
关键词
end-to-end speech recognition; cross-lingual transfer learning; speech translation; machine translation;
D O I
10.21437/Interspeech.2020-2955
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Transfer learning from high-resource languages is known to be an efficient way to improve end-to-end automatic speech recognition (ASR) for low-resource languages. Pre-trained or jointly trained encoder-decoder models, however, do not share the language modeling (decoder) for the same language, which is likely to be inefficient for distant target languages. We introduce speech-to-text translation (ST) as an auxiliary task to incorporate additional knowledge of the target language and enable transferring from that target language. Specifically, we first translate high-resource ASR transcripts into a target low resource language, with which a ST model is trained. Both ST and target ASR share the same attention-based encoder-decoder architecture and vocabulary. The former task then provides a fully pre-trained model for the latter, bringing up to 24.6% word error rate (WER) reduction to the baseline (direct transfer from high-resource ASR). We show that training ST with human translations is not necessary. ST trained with machine translation (MT) pseudo-labels brings consistent gains. It can even outperform those using human labels when transferred to target ASR by leveraging only 500K MT examples. Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.
引用
收藏
页码:4731 / 4735
页数:5
相关论文
共 41 条
[1]  
Anastasopoulos A, 2018, INTERSPEECH, P1279
[2]  
Anastasopoulos Antonios, 2018, P 2018 C N AM CHAPT, V1, P82, DOI [DOI 10.18653/V1/N18-1008, 10.18653/v1/N18-1008]
[3]  
[Anonymous], 2015, ARXIV PREPRINT ARXIV
[4]  
[Anonymous], 2017, STATE ART SPEECH REC
[5]  
Ardila Rosana, 2019, COMMON VOICE MASSIVE
[6]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[7]  
Bansal S., 2018, ARXIV180901431
[8]  
Bérard A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P6224, DOI 10.1109/ICASSP.2018.8461690
[9]  
Berard Alexandre, 2016, P NIPS WORKSH END TO
[10]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621