Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech

被引:2
作者
Jeong, Myeonghun [1 ,2 ]
Kim, Minchan [2 ]
Choi, Byoung Jin [2 ]
Yoon, Jaesam [1 ]
Jang, Won [1 ]
Kim, Nam Soo [2 ]
机构
[1] Kakao Enterprise, Seongnam 13494, South Korea
[2] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea
关键词
Linguistics; Data models; Training; Acoustics; Transfer learning; Feature extraction; Training data; Low resource TTS; multi-lingual TTS; zero-shot multi-speaker TTS; self-supervised speech representation;
D O I
10.1109/TASLP.2024.3364085
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Though neural text-to-speech (TTS) models show remarkable performance, they still require a large amount of < speech, text> paired dataset, which is expensive to collect. The heavy demand for collecting paired datasets makes the TTS models support only a small number of speakers and languages. To address this problem, we introduce a transfer learning framework for multi-lingual, zero-shot multi-speaker, and low-resource TTS. Firstly, we pretrain our model in an unsupervised manner with a multi-lingual multi-speaker speech-only dataset by leveraging the self-supervised speech representations as intermediate linguistic representations. Given this pretrained linguistic information, we then apply a supervised learning technique to the TTS model with a small amount of paired dataset. The pretrained linguistic representations extracted from the large-scale speech-only dataset facilitate phoneme-to-linguistic feature matching, which provides good guidance for supervised learning with a limited amount of labeled data. We evaluate the performance of our proposed model in low-resource, multi-lingual, and zero-shot multi-speaker TTS tasks. The experimental results demonstrate that our proposed method outperforms the baseline in terms of naturalness, intelligibility, and speaker similarity.
引用
收藏
页码:1519 / 1530
页数:12
相关论文
共 53 条
[1]   Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages [J].
Azizah, Kurniawati ;
Jatmiko, Wisnu .
IEEE ACCESS, 2022, 10 :5895-5911
[2]  
Baevski A., 2021, Advances in Neural Information Processing Systems, VVolume 34, P27826
[3]  
Baevski A, 2020, ADV NEUR IN, V33
[4]  
Cao YW, 2019, INT CONF ACOUST SPEE, P6935, DOI [10.1109/icassp.2019.8682927, 10.1109/ICASSP.2019.8682927]
[5]  
Casanova E, 2022, PR MACH LEARN RES
[6]   SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model [J].
Casanova, Edresson ;
Shulby, Christopher ;
Golge, Eren ;
Muller, Nicolas Michael ;
de Oliveira, Frederico Santos ;
Candido Junior, Arnaldo ;
Soares, Anderson da Silva ;
Aluisio, Sandra Maria ;
Ponti, Moacir Antonelli .
INTERSPEECH 2021, 2021, :3645-3649
[7]   Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding [J].
Chen, Mengnan ;
Chen, Minchuan ;
Liang, Shuang ;
Ma, Jun ;
Chen, Lei ;
Wang, Shaojun ;
Xiao, Jing .
INTERSPEECH 2019, 2019, :2105-2109
[8]  
Cho W., 2022, INTERSPEECH, P1
[9]  
Choi BJ, 2022, ASIAPAC SIGN INFO PR, P1708, DOI 10.23919/APSIPAASC55919.2022.9979900
[10]   SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech [J].
Choi, Byoung Jin ;
Jeong, Myeonghun ;
Lee, Joun Yeop ;
Kim, Nam Soo .
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 :2502-2506