EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

被引:2
作者
Miao, Chenfeng [1 ]
Zhu, Qingying [1 ]
Chen, Minchuan [1 ]
Ma, Jun [1 ]
Wang, Shaojun [1 ]
Xiao, Jing [1 ]
机构
[1] Ping Technol, Shanghai 200120, Peoples R China
关键词
Training; Vectors; Computational modeling; Task analysis; Acoustics; Couplings; Computer architecture; Text-to-speech; speech synthesis; voice conversion; differentiable aligner; VAE; hierarchical-VAE; end-to-end;
D O I
10.1109/TASLP.2024.3369528
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.
引用
收藏
页码:1650 / 1661
页数:12
相关论文
共 50 条
  • [21] Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
    Chung, Hyunseung
    Lee, Sang-Hoon
    Lee, Seong-Whan
    INTERSPEECH 2021, 2021, : 3635 - 3639
  • [22] END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE
    Yasuda, Yusuke
    Wang, Xin
    Yamagishi, Junichi
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5694 - 5698
  • [23] ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT
    Hayashi, Tomoki
    Yamamoto, Ryuichi
    Inoue, Katsuki
    Yoshimura, Takenori
    Watanabe, Shinji
    Toda, Tomoki
    Takeda, Kazuya
    Zhang, Yu
    Tan, Xu
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7654 - 7658
  • [24] Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech
    Kakegawa, Naoto
    Hara, Sunao
    Abe, Masanobu
    Ijima, Yusuke
    INTERSPEECH 2021, 2021, : 126 - 130
  • [25] Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture
    Fahmy, Fady K.
    Abbas, Hazem M.
    Khalil, Mahmoud, I
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2022, 25 (01) : 79 - 88
  • [26] Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture
    Fady K. Fahmy
    Hazem M. Abbas
    Mahmoud I. Khalil
    International Journal of Speech Technology, 2022, 25 : 79 - 88
  • [27] Towards End-to-End Synthetic Speech Detection
    Hua, Guang
    Teoh, Andrew Beng Jin
    Zhang, Haijian
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 1265 - 1269
  • [28] EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH
    Kim, Tae-Ho
    Cho, Sungjae
    Choi, Shinkook
    Park, Sejik
    Lee, Soo-Young
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7774 - 7778
  • [29] NIX-TTS: LIGHTWEIGHT AND END-TO-END TEXT-TO-SPEECH VIA MODULE-WISE DISTILLATION
    Chevi, Rendi
    Prasojo, Radityo Eko
    Aji, Alham Fikri
    Tjandra, Andros
    Sakti, Sakriani
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 970 - 976
  • [30] CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
    Wang, Tao
    Yi, Jiangyan
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2241 - 2254