Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech

被引：2

作者：

Jeong, Myeonghun ^{[1
,2
]}

Kim, Minchan ^{[2
]}

Choi, Byoung Jin ^{[2
]}

Yoon, Jaesam ^{[1
]}

Jang, Won ^{[1
]}

Kim, Nam Soo ^{[2
]}

机构：

[1] Kakao Enterprise, Seongnam 13494, South Korea

[2] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Linguistics; Data models; Training; Acoustics; Transfer learning; Feature extraction; Training data; Low resource TTS; multi-lingual TTS; zero-shot multi-speaker TTS; self-supervised speech representation;

D O I：

10.1109/TASLP.2024.3364085

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Though neural text-to-speech (TTS) models show remarkable performance, they still require a large amount of < speech, text> paired dataset, which is expensive to collect. The heavy demand for collecting paired datasets makes the TTS models support only a small number of speakers and languages. To address this problem, we introduce a transfer learning framework for multi-lingual, zero-shot multi-speaker, and low-resource TTS. Firstly, we pretrain our model in an unsupervised manner with a multi-lingual multi-speaker speech-only dataset by leveraging the self-supervised speech representations as intermediate linguistic representations. Given this pretrained linguistic information, we then apply a supervised learning technique to the TTS model with a small amount of paired dataset. The pretrained linguistic representations extracted from the large-scale speech-only dataset facilitate phoneme-to-linguistic feature matching, which provides good guidance for supervised learning with a limited amount of labeled data. We evaluate the performance of our proposed model in low-resource, multi-lingual, and zero-shot multi-speaker TTS tasks. The experimental results demonstrate that our proposed method outperforms the baseline in terms of naturalness, intelligibility, and speaker similarity.

引用

页码：1519 / 1530

页数：12

共 53 条

[1] Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages [J].

Azizah, Kurniawati ;

Jatmiko, Wisnu .

IEEE ACCESS, 2022, 10 :5895-5911

[2]

Baevski A., 2021, Advances in Neural Information Processing Systems, VVolume 34, P27826

[3]

Baevski A, 2020, ADV NEUR IN, V33

[4]

Cao YW, 2019, INT CONF ACOUST SPEE, P6935, DOI [10.1109/icassp.2019.8682927, 10.1109/ICASSP.2019.8682927]

[5]

Casanova E, 2022, PR MACH LEARN RES

[6] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model [J].

Casanova, Edresson ;

Shulby, Christopher ;

Golge, Eren ;

Muller, Nicolas Michael ;

de Oliveira, Frederico Santos ;

Candido Junior, Arnaldo ;

Soares, Anderson da Silva ;

Aluisio, Sandra Maria ;

Ponti, Moacir Antonelli .

INTERSPEECH 2021, 2021, :3645-3649

[7] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding [J].

Chen, Mengnan ;

Chen, Minchuan ;

Liang, Shuang ;

Ma, Jun ;

Chen, Lei ;

Wang, Shaojun ;

Xiao, Jing .

INTERSPEECH 2019, 2019, :2105-2109

[8]

Cho W., 2022, INTERSPEECH, P1

[9]

Choi BJ, 2022, ASIAPAC SIGN INFO PR, P1708, DOI 10.23919/APSIPAASC55919.2022.9979900

[10] SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech [J].

Choi, Byoung Jin ;

Jeong, Myeonghun ;

Lee, Joun Yeop ;

Kim, Nam Soo .

IEEE SIGNAL PROCESSING LETTERS, 2022, 29 :2502-2506

← 1 2 3 4 5 6 →