Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding

被引:31
作者
Chen, Mengnan [1 ]
Chen, Minchuan [2 ]
Liang, Shuang [2 ]
Ma, Jun [2 ]
Chen, Lei [1 ]
Wang, Shaojun [2 ]
Xiao, Jing [2 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
[2] Ping An Technol, Shenzhen, Guangdong, Peoples R China
来源
INTERSPEECH 2019 | 2019年
关键词
neural TTS; multi-speaker modeling; multilanguage; speaker embedding;
D O I
10.21437/Interspeech.2019-1632
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Neural network-based model for text-to-speech (TTS) synthesis has made significant progress in recent years. In this paper, we present a cross-lingual, multi-speaker neural end-to-end TTS framework which can model speaker characteristics and synthesize speech in different languages. We implement the model by introducing a separately trained neural speaker embedding network, which can represent the latent structure of different speakers and language pronunciations. We train the speech synthesis network bilingually and prove the possibility of synthesizing Chinese speaker's English speech and vice versa. We explore different methods to fit a new speaker using only a few speech samples. The experimental results show that, with only several minutes of audio from a new speaker, the proposed model can synthesize speech bilingually and acquire decent naturalness and similarity for both languages.
引用
收藏
页码:2105 / 2109
页数:5
相关论文
共 25 条
[11]  
Li C., 2017, ARXIV170502304
[12]  
Mortensen DR, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P2710
[13]  
Ping W., 2019, INT C LEARN REPR
[14]  
Ping Wei, 2018, INT C LEARN REPR
[15]  
Qian Y, 2011, INT CONF ACOUST SPEE, P5120
[16]   A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin-English) TTS [J].
Qian, Yao ;
Liang, Hui ;
Soong, Frank K. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (06) :1231-1239
[17]  
Ribeiro F, 2011, INT CONF ACOUST SPEE, P2416
[18]  
Schroff F, 2015, PROC CVPR IEEE, P815, DOI 10.1109/CVPR.2015.7298682
[19]  
Shen J, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4779, DOI 10.1109/ICASSP.2018.8461368
[20]  
surfing.ai, 2017, ST CMDS 20170001 1 F