YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone

被引:0
|
作者
Casanova, Edresson [1 ,2 ]
Weber, Julian [2 ,3 ]
Shulby, Christopher [4 ]
Candido Junior, Arnaldo [5 ]
Goelge, Eren [2 ]
Ponti, Moacir Antonelli [1 ,6 ]
机构
[1] Univ Sao Paulo, Inst Ciencias Matemat & Comp, Sao Paulo, Brazil
[2] Coqui, Heidelberg, Germany
[3] Sopra Banking Software, Paris, France
[4] Defined Ai, Seattle, WA USA
[5] Univ Tecnol Fed Parana, Curitiba, Parana, Brazil
[6] Mercado Livre, Sao Paulo, Brazil
来源
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162 | 2022年
基金
巴西圣保罗研究基金会;
关键词
cross-lingual zero-shot multi-speaker TTS; text-to-speech; cross-lingual zero-shot voice conversion; speaker adaptation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multispeaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multispeaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multispeaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-theart results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
    Jeon, Yejin
    Kim, Yunsu
    Lee, Gary Geunbae
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18336 - 18344
  • [2] Zero-shot multi-speaker accent TTS with limited accent data
    Zhang, Mingyang
    Zhou, Yi
    Wu, Zhizheng
    Li, Haizhou
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1931 - 1936
  • [3] Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
    Kumar, Neeraj
    Goel, Srishti
    Narang, Ankur
    Lall, Brejesh
    INTERSPEECH 2021, 2021, : 1354 - 1358
  • [4] Zero-Shot Unseen Speaker Anonymization via Voice Conversion
    Chang, Hyung-Pil
    Yoo, In-Chul
    Jeong, Changhyeon
    Yook, Dongsuk
    IEEE ACCESS, 2022, 10 : 130190 - 130199
  • [5] SCALING NVIDIA'S MULTI-SPEAKER MULTI-LINGUAL TTS SYSTEMS WITH ZERO-SHOT TTS TO INDIC LANGUAGES
    Arora, Akshit
    Badlani, Rohan
    Kim, Sungwon
    Valle, Rafael
    Catanzaro, Bryan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 115 - 116
  • [6] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
    Kumar, Neeraj
    Narang, Ankur
    Lall, Brejesh
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
  • [7] Towards Improved Zero-shot Voice Conversion with Conditional DSVAE
    Lian, Jiachen
    Zhang, Chunlei
    Anumanchipalli, Gopala Krishna
    Yu, Dong
    INTERSPEECH 2022, 2022, : 2598 - 2602
  • [8] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
    Zhang, Mingyang
    Zhou, Xuehao
    Wu, Zhizheng
    Li, Haizhou
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951
  • [9] Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models
    Levkovitch, Alon
    Nachmani, Eliya
    Wolf, Lior
    INTERSPEECH 2022, 2022, : 2983 - 2987
  • [10] ZERO-SHOT VOICE CONVERSION WITH ADJUSTED SPEAKER EMBEDDINGS AND SIMPLE ACOUSTIC FEATURES
    Tan, Zhiyuan
    Wei, Jianguo
    Xu, Junhai
    He, Yuqing
    Lu, Wenhuan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5964 - 5968