Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech

被引:0
作者
Jeong, Myeonghun [1 ,2 ]
Kim, Minchan [2 ]
Choi, Byoung Jin [2 ]
Yoon, Jaesam [1 ]
Jang, Won [1 ]
Kim, Nam Soo [2 ]
机构
[1] Kakao Enterprise, Seongnam 13494, South Korea
[2] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea
关键词
Linguistics; Data models; Training; Acoustics; Transfer learning; Feature extraction; Training data; Low resource TTS; multi-lingual TTS; zero-shot multi-speaker TTS; self-supervised speech representation;
D O I
10.1109/TASLP.2024.3364085
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Though neural text-to-speech (TTS) models show remarkable performance, they still require a large amount of < speech, text> paired dataset, which is expensive to collect. The heavy demand for collecting paired datasets makes the TTS models support only a small number of speakers and languages. To address this problem, we introduce a transfer learning framework for multi-lingual, zero-shot multi-speaker, and low-resource TTS. Firstly, we pretrain our model in an unsupervised manner with a multi-lingual multi-speaker speech-only dataset by leveraging the self-supervised speech representations as intermediate linguistic representations. Given this pretrained linguistic information, we then apply a supervised learning technique to the TTS model with a small amount of paired dataset. The pretrained linguistic representations extracted from the large-scale speech-only dataset facilitate phoneme-to-linguistic feature matching, which provides good guidance for supervised learning with a limited amount of labeled data. We evaluate the performance of our proposed model in low-resource, multi-lingual, and zero-shot multi-speaker TTS tasks. The experimental results demonstrate that our proposed method outperforms the baseline in terms of naturalness, intelligibility, and speaker similarity.
引用
收藏
页码:1519 / 1530
页数:12
相关论文
共 53 条
  • [51] Yamagishi Junichi, 2019, EDS
  • [52] Zavalishin V., 2012, The Art of VA Filter Design
  • [53] Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
    Zhang, Yu
    Weiss, Ron J.
    Zen, Heiga
    Wu, Yonghui
    Chen, Zhifeng
    Skerry-Ryan, Rj
    Jia, Ye
    Rosenberg, Andrew
    Ramabhadran, Bhuvana
    [J]. INTERSPEECH 2019, 2019, : 2080 - 2084