Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

被引:12
作者
Zhao, Shengkui [1 ]
Nguyen, Trung Hieu [1 ]
Wang, Hao [1 ]
Ma, Bin [1 ]
机构
[1] Alibaba Grp, Machine Intelligence Technol, Hangzhou, Peoples R China
来源
INTERSPEECH 2020 | 2020年
关键词
cross-lingual voice conversion; Phonetic PosteriorGrams (PPGs); Tacotron2; Transformer; FastSpeech; text-to-speech; bilingual; code-switching;
D O I
10.21437/Interspeech.2020-1163
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recent state-of-the-art neural text-to-speech (TTS) synthesis models have dramatically improved intelligibility and naturalness of generated speech from text. However, building a good bilingual or code-switched TTS for a particular voice is still a challenge. The main reason is that it is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages. In this paper, we explore the use of Mandarin speech recordings from a Mandarin speaker, and English speech recordings from another English speaker to build high-quality bilingual and code-switched TTS for both speakers. A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech, which show good naturalness and speaker similarity. The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model. With these data, three neural TTS models - Tacotron2, Transformer and FastSpeech are applied for building bilingual and code-switched TTS. Subjective evaluation results show that all the three systems can produce (near)native-level speech in both languages for each of the speaker.
引用
收藏
页码:2927 / 2931
页数:5
相关论文
共 28 条
  • [1] Alex G., 2013, ARXIV13080850
  • [2] [Anonymous], 2009, P ASRU
  • [3] Bu Hui, 2017, Proceedings of O-COCOSDA, P1
  • [4] Cao Y., 2019, P ICASSP
  • [5] Chang S, 2017, PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON RELIABILITY SYSTEMS ENGINEERING (ICRSE 2017)
  • [6] INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora
    Erro, Daniel
    Moreno, Asuncion
    Bonafonte, Antonio
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05): : 944 - 953
  • [7] He J, 2012, PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON APAC 2011
  • [8] Kintzley K., 2011, P INTERSPEECH
  • [9] Latorre J., 2005, P ICASSP
  • [10] Li N., 2018, ARXIV180908895