YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone

被引：0

作者：

Casanova, Edresson ^{[1
,2
]}

Weber, Julian ^{[2
,3
]}

Shulby, Christopher ^{[4
]}

Candido Junior, Arnaldo ^{[5
]}

Goelge, Eren ^{[2
]}

Ponti, Moacir Antonelli ^{[1
,6
]}

机构：

[1] Univ Sao Paulo, Inst Ciencias Matemat & Comp, Sao Paulo, Brazil

[2] Coqui, Heidelberg, Germany

[3] Sopra Banking Software, Paris, France

[4] Defined Ai, Seattle, WA USA

[5] Univ Tecnol Fed Parana, Curitiba, Parana, Brazil

[6] Mercado Livre, Sao Paulo, Brazil

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162 | 2022年

基金：

巴西圣保罗研究基金会;

关键词：

cross-lingual zero-shot multi-speaker TTS; text-to-speech; cross-lingual zero-shot voice conversion; speaker adaptation;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multispeaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multispeaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multispeaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-theart results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

引用

页数：12

共 50 条

[1] Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
Jeon, Yejin
Kim, Yunsu
Lee, Gary Geunbae
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18336 - 18344
[2] Zero-shot multi-speaker accent TTS with limited accent data
Zhang, Mingyang
Zhou, Yi
Wu, Zhizheng
Li, Haizhou
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1931 - 1936
[3] Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
Kumar, Neeraj
Goel, Srishti
Narang, Ankur
Lall, Brejesh
INTERSPEECH 2021, 2021, : 1354 - 1358
[4] Zero-Shot Unseen Speaker Anonymization via Voice Conversion
Chang, Hyung-Pil
Yoo, In-Chul
Jeong, Changhyeon
Yook, Dongsuk
IEEE ACCESS, 2022, 10 : 130190 - 130199
[5] SCALING NVIDIA'S MULTI-SPEAKER MULTI-LINGUAL TTS SYSTEMS WITH ZERO-SHOT TTS TO INDIC LANGUAGES
Arora, Akshit
Badlani, Rohan
Kim, Sungwon
Valle, Rafael
Catanzaro, Bryan
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 115 - 116
[6] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[7] Towards Improved Zero-shot Voice Conversion with Conditional DSVAE
Lian, Jiachen
Zhang, Chunlei
Anumanchipalli, Gopala Krishna
Yu, Dong
INTERSPEECH 2022, 2022, : 2598 - 2602
[8] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
Zhang, Mingyang
Zhou, Xuehao
Wu, Zhizheng
Li, Haizhou
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951
[9] Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models
Levkovitch, Alon
Nachmani, Eliya
Wolf, Lior
INTERSPEECH 2022, 2022, : 2983 - 2987
[10] ZERO-SHOT VOICE CONVERSION WITH ADJUSTED SPEAKER EMBEDDINGS AND SIMPLE ACOUSTIC FEATURES
Tan, Zhiyuan
Wei, Jianguo
Xu, Junhai
He, Yuqing
Lu, Wenhuan
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5964 - 5968

← 1 2 3 4 5 →