End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning

被引:40
作者
Chen, Yuan-Jui [1 ]
Tu, Tao [1 ]
Yeh, Cheng-chieh [1 ]
Lee, Hung-yi [1 ]
机构
[1] Natl Taiwan Univ, Coll Elect Engn & Comp Sci, Taipei, Taiwan
来源
INTERSPEECH 2019 | 2019年
关键词
end-to-end; speech synthesis; transfer learning; cross-lingual; low-resource;
D O I
10.21437/Interspeech.2019-2730
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
End-to-end text-to-speech (TTS) has shown great success on large quantities of paired text plus speech data. However, laborious data collection remains difficult for at least 95% of the languages over the world, which hinders the development of TTS in different languages. In this paper, we aim to build TTS systems for such low-resource (target) languages where only very limited paired data are available. We show such TTS can be effectively constructed by transferring knowledge from a high-resource (source) language. Since the model trained on source language cannot be directly applied to target language due to input space mismatch, we propose a method to learn a mapping between source and target linguistic symbols. Benefiting from this learned mapping, pronunciation information can be preserved throughout the transferring procedure. Preliminary experiments show that we only need around 15 minutes of paired data to obtain a relatively good TTS system. Furthermore, analytic studies demonstrated that the automatically discovered mapping correlate well with the phonetic expertise.
引用
收藏
页码:2075 / 2079
页数:5
相关论文
共 28 条
[1]  
[Anonymous], 2018, ARXIV180300860
[2]  
[Anonymous], 2017, ICLR
[3]  
[Anonymous], 2017, Char2wav: End-to-end speech synthesis
[4]  
Arik SÖ, 2018, ADV NEUR IN, V31
[5]  
Association I.P., 1999, HDB INT PHONETIC ASS
[6]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[7]  
Chung YA, 2019, INT CONF ACOUST SPEE, P6940, DOI 10.1109/ICASSP.2019.8683862
[8]  
Demirsahin I., 2018, UNIFIED PHONOLOGICAL
[9]  
Graves A., 2006, P 23 INT C MACH LEAR, P369
[10]   SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].
GRIFFIN, DW ;
LIM, JS .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243