EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

被引：2

作者：

Miao, Chenfeng ^{[1
]}

Zhu, Qingying ^{[1
]}

Chen, Minchuan ^{[1
]}

Ma, Jun ^{[1
]}

Wang, Shaojun ^{[1
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping Technol, Shanghai 200120, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Training; Vectors; Computational modeling; Task analysis; Acoustics; Couplings; Computer architecture; Text-to-speech; speech synthesis; voice conversion; differentiable aligner; VAE; hierarchical-VAE; end-to-end;

D O I：

10.1109/TASLP.2024.3369528

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.

引用

页码：1650 / 1661

页数：12

共 50 条

[21] Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
Chung, Hyunseung
Lee, Sang-Hoon
Lee, Seong-Whan
INTERSPEECH 2021, 2021, : 3635 - 3639
[22] END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE
Yasuda, Yusuke
Wang, Xin
Yamagishi, Junichi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5694 - 5698
[23] ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT
Hayashi, Tomoki
Yamamoto, Ryuichi
Inoue, Katsuki
Yoshimura, Takenori
Watanabe, Shinji
Toda, Tomoki
Takeda, Kazuya
Zhang, Yu
Tan, Xu
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7654 - 7658
[24] Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech
Kakegawa, Naoto
Hara, Sunao
Abe, Masanobu
Ijima, Yusuke
INTERSPEECH 2021, 2021, : 126 - 130
[25] Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture
Fahmy, Fady K.
Abbas, Hazem M.
Khalil, Mahmoud, I
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2022, 25 (01) : 79 - 88
[26] Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture
Fady K. Fahmy
Hazem M. Abbas
Mahmoud I. Khalil
International Journal of Speech Technology, 2022, 25 : 79 - 88
[27] Towards End-to-End Synthetic Speech Detection
Hua, Guang
Teoh, Andrew Beng Jin
Zhang, Haijian
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) : 1265 - 1269
[28] EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH
Kim, Tae-Ho
Cho, Sungjae
Choi, Shinkook
Park, Sejik
Lee, Soo-Young
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7774 - 7778
[29] NIX-TTS: LIGHTWEIGHT AND END-TO-END TEXT-TO-SPEECH VIA MODULE-WISE DISTILLATION
Chevi, Rendi
Prasojo, Radityo Eko
Aji, Alham Fikri
Tjandra, Andros
Sakti, Sakriani
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 970 - 976
[30] CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
Wang, Tao
Yi, Jiangyan
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2241 - 2254

← 1 2 3 4 5 →