EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

被引:2
作者
Miao, Chenfeng [1 ]
Zhu, Qingying [1 ]
Chen, Minchuan [1 ]
Ma, Jun [1 ]
Wang, Shaojun [1 ]
Xiao, Jing [1 ]
机构
[1] Ping Technol, Shanghai 200120, Peoples R China
关键词
Training; Vectors; Computational modeling; Task analysis; Acoustics; Couplings; Computer architecture; Text-to-speech; speech synthesis; voice conversion; differentiable aligner; VAE; hierarchical-VAE; end-to-end;
D O I
10.1109/TASLP.2024.3369528
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.
引用
收藏
页码:1650 / 1661
页数:12
相关论文
共 50 条
  • [1] NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality
    Tan, Xu
    Chen, Jiawei
    Liu, Haohe
    Cong, Jian
    Zhang, Chen
    Liu, Yanqing
    Wang, Xi
    Leng, Yichong
    Yi, Yuanhao
    He, Lei
    Zhao, Sheng
    Qin, Tao
    Soong, Frank
    Liu, Tie-Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4234 - 4245
  • [2] Myanmar Text-to-Speech Synthesis Using End-to-End Model
    Qin, Qinglai
    Yang, Jian
    Li, Peiying
    2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 6 - 11
  • [3] End-to-End Mongolian Text-to-Speech System
    Li, Jingdong
    Zhang, Hui
    Liu, Rui
    Zhang, Xueliang
    Bao, Feilong
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 483 - 487
  • [4] EXPLORING END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS FOR ROMANIAN
    Dumitrache, Marius
    Rebedea, Traian
    PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING, 2020, : 93 - 102
  • [5] End-to-End Text-To-Speech synthesis for under resourced South African languages
    Nthite, Thapelo
    Tsoeu, Mohohlo
    2020 INTERNATIONAL SAUPEC/ROBMECH/PRASA CONFERENCE, 2020, : 684 - 689
  • [6] Effective Emotion Transplantation in an End-to-End Text-to-Speech System
    Joo, Young-Sun
    Bae, Hanbin
    Kim, Young-Ik
    Cho, Hoon-Young
    Kang, Hong-Goo
    IEEE ACCESS, 2020, 8 : 161713 - 161719
  • [7] End-to-End Thai Text-to-Speech with Linguistic Unit
    Wisetpaitoon, Kontawat
    Singkul, Sattaya
    Sakdejayont, Theerat
    Chalothorn, Tawunrat
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 951 - 959
  • [8] End-to-end text-to-speech synthesis with unaligned multiple language units based on attention
    Aso, Masashi
    Takamichi, Shinnosuke
    Saruwatari, Hiroshi
    INTERSPEECH 2020, 2020, : 4009 - 4013
  • [9] On the Training and Testing Data Preparation for End-to-End Text-to-Speech Application
    Duc Chung Tran
    Khan, M. K. A. Ahamed
    Sridevi, S.
    2020 11TH IEEE CONTROL AND SYSTEM GRADUATE RESEARCH COLLOQUIUM (ICSGRC), 2020, : 73 - 75
  • [10] Multi speaker text-to-speech synthesis using generalized end-to-end loss function
    Nazir, Owais
    Malik, Aruna
    Singh, Samayveer
    Pathan, Al-Sakib Khan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 64205 - 64222