EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

被引:2
作者
Miao, Chenfeng [1 ]
Zhu, Qingying [1 ]
Chen, Minchuan [1 ]
Ma, Jun [1 ]
Wang, Shaojun [1 ]
Xiao, Jing [1 ]
机构
[1] Ping Technol, Shanghai 200120, Peoples R China
关键词
Training; Vectors; Computational modeling; Task analysis; Acoustics; Couplings; Computer architecture; Text-to-speech; speech synthesis; voice conversion; differentiable aligner; VAE; hierarchical-VAE; end-to-end;
D O I
10.1109/TASLP.2024.3369528
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.
引用
收藏
页码:1650 / 1661
页数:12
相关论文
共 50 条
  • [41] End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks
    Mira, Rodrigo
    Vougioukas, Konstantinos
    Ma, Pingchuan
    Petridis, Stavros
    Schuller, Bjoern W.
    Pantic, Maja
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (06) : 3454 - 3466
  • [42] Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis
    Li, Tao
    Wang, Xinsheng
    Xie, Qicong
    Wang, Zhichao
    Xie, Lei
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1448 - 1460
  • [43] Multitask Training with Text Data for End-to-End Speech Recognition
    Wang, Peidong
    Sainath, Tara N.
    Weiss, Ron J.
    INTERSPEECH 2021, 2021, : 2566 - 2570
  • [44] NVC-NET: END-TO-END ADVERSARIAL VOICE CONVERSION
    Nguyen, Bac
    Cardinaux, Fabien
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7012 - 7016
  • [45] END-TO-END VOICE CONVERSION VIA CROSS-MODAL KNOWLEDGE DISTILLATION FOR DYSARTHRIC SPEECH RECONSTRUCTION
    Wang, Disong
    Yu, Jianwei
    Wu, Xixin
    Liu, Songxiang
    Sung, Lifa
    Liu, Xunying
    Meng, Helen
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7744 - 7748
  • [46] Corpus generation for voice command in smart home and the effect of speech synthesis on End-to-End SLU
    Desot, Thierry
    Portet, Francois
    Vacher, Michel
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6395 - 6404
  • [47] EXTENDING PARROTRON: AN END-TO-END, SPEECH CONVERSION AND SPEECH RECOGNITION MODEL FOR ATYPICAL SPEECH
    Doshi, Rohan
    Chen, Youzheng
    Jiang, Liyang
    Zhang, Xia
    Biadsy, Fadi
    Ramabhadran, Bhuvana
    Chu, Fang
    Rosenberg, Andrew
    Moreno, Pedro J.
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6988 - 6992
  • [48] Memory Attention: Robust Alignment Using Gating Mechanism for End-to-End Speech Synthesis
    Lee, Joun Yeop
    Cheon, Sung Jun
    Choi, Byoung Jin
    Kim, Nam Soo
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 2004 - 2008
  • [49] Vocoder-free End-to-End Voice Conversion with Transformer Network
    Kim, June-Woo
    Jung, Ho-Young
    Lee, Minho
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [50] Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion
    Liu, Andy T.
    Hsu, Po-chun
    Lee, Hung-yi
    INTERSPEECH 2019, 2019, : 1108 - 1112