Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech

被引:0
作者
Jeong, Myeonghun [1 ,2 ]
Kim, Minchan [2 ]
Choi, Byoung Jin [2 ]
Yoon, Jaesam [1 ]
Jang, Won [1 ]
Kim, Nam Soo [2 ]
机构
[1] Kakao Enterprise, Seongnam 13494, South Korea
[2] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea
关键词
Linguistics; Data models; Training; Acoustics; Transfer learning; Feature extraction; Training data; Low resource TTS; multi-lingual TTS; zero-shot multi-speaker TTS; self-supervised speech representation;
D O I
10.1109/TASLP.2024.3364085
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Though neural text-to-speech (TTS) models show remarkable performance, they still require a large amount of < speech, text> paired dataset, which is expensive to collect. The heavy demand for collecting paired datasets makes the TTS models support only a small number of speakers and languages. To address this problem, we introduce a transfer learning framework for multi-lingual, zero-shot multi-speaker, and low-resource TTS. Firstly, we pretrain our model in an unsupervised manner with a multi-lingual multi-speaker speech-only dataset by leveraging the self-supervised speech representations as intermediate linguistic representations. Given this pretrained linguistic information, we then apply a supervised learning technique to the TTS model with a small amount of paired dataset. The pretrained linguistic representations extracted from the large-scale speech-only dataset facilitate phoneme-to-linguistic feature matching, which provides good guidance for supervised learning with a limited amount of labeled data. We evaluate the performance of our proposed model in low-resource, multi-lingual, and zero-shot multi-speaker TTS tasks. The experimental results demonstrate that our proposed method outperforms the baseline in terms of naturalness, intelligibility, and speaker similarity.
引用
收藏
页码:1519 / 1530
页数:12
相关论文
共 53 条
  • [1] Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages
    Azizah, Kurniawati
    Jatmiko, Wisnu
    [J]. IEEE ACCESS, 2022, 10 : 5895 - 5911
  • [2] Baevski A., 2021, Advances in Neural Information Processing Systems, P27826
  • [3] Baevski A, 2020, ADV NEUR IN, V33
  • [4] Cao YW, 2019, INT CONF ACOUST SPEE, P6935, DOI [10.1109/ICASSP.2019.8682927, 10.1109/icassp.2019.8682927]
  • [5] Casanova E, 2022, PR MACH LEARN RES
  • [6] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
    Casanova, Edresson
    Shulby, Christopher
    Golge, Eren
    Muller, Nicolas Michael
    de Oliveira, Frederico Santos
    Candido Junior, Arnaldo
    Soares, Anderson da Silva
    Aluisio, Sandra Maria
    Ponti, Moacir Antonelli
    [J]. INTERSPEECH 2021, 2021, : 3645 - 3649
  • [7] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
    Chen, Mengnan
    Chen, Minchuan
    Liang, Shuang
    Ma, Jun
    Chen, Lei
    Wang, Shaojun
    Xiao, Jing
    [J]. INTERSPEECH 2019, 2019, : 2105 - 2109
  • [8] Cho W., 2022, INTERSPEECH, P1
  • [9] Choi BJ, 2022, ASIAPAC SIGN INFO PR, P1708, DOI 10.23919/APSIPAASC55919.2022.9979900
  • [10] SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
    Choi, Byoung Jin
    Jeong, Myeonghun
    Lee, Joun Yeop
    Kim, Nam Soo
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2502 - 2506