Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech

被引：2

作者：

Jeong, Myeonghun ^{[1
,2
]}

Kim, Minchan ^{[2
]}

Choi, Byoung Jin ^{[2
]}

Yoon, Jaesam ^{[1
]}

Jang, Won ^{[1
]}

Kim, Nam Soo ^{[2
]}

机构：

[1] Kakao Enterprise, Seongnam 13494, South Korea

[2] Seoul Natl Univ, Inst New Media & Commun, Dept Elect & Comp Engn, Seoul 08826, South Korea

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Linguistics; Data models; Training; Acoustics; Transfer learning; Feature extraction; Training data; Low resource TTS; multi-lingual TTS; zero-shot multi-speaker TTS; self-supervised speech representation;

D O I：

10.1109/TASLP.2024.3364085

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Though neural text-to-speech (TTS) models show remarkable performance, they still require a large amount of < speech, text> paired dataset, which is expensive to collect. The heavy demand for collecting paired datasets makes the TTS models support only a small number of speakers and languages. To address this problem, we introduce a transfer learning framework for multi-lingual, zero-shot multi-speaker, and low-resource TTS. Firstly, we pretrain our model in an unsupervised manner with a multi-lingual multi-speaker speech-only dataset by leveraging the self-supervised speech representations as intermediate linguistic representations. Given this pretrained linguistic information, we then apply a supervised learning technique to the TTS model with a small amount of paired dataset. The pretrained linguistic representations extracted from the large-scale speech-only dataset facilitate phoneme-to-linguistic feature matching, which provides good guidance for supervised learning with a limited amount of labeled data. We evaluate the performance of our proposed model in low-resource, multi-lingual, and zero-shot multi-speaker TTS tasks. The experimental results demonstrate that our proposed method outperforms the baseline in terms of naturalness, intelligibility, and speaker similarity.

引用

页码：1519 / 1530

页数：12

共 53 条

[51]

Yamagishi Junichi, 2019, EDS

[52]

Zavalishin V., 2012, The Art of VA Filter Design

[53] Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning [J].

Zhang, Yu ;

Weiss, Ron J. ;

Zen, Heiga ;

Wu, Yonghui ;

Chen, Zhifeng ;

Skerry-Ryan, Rj ;

Jia, Ye ;

Rosenberg, Andrew ;

Ramabhadran, Bhuvana .

INTERSPEECH 2019, 2019, :2080-2084

← 1 2 3 4 5 6 →