EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

被引：2

作者：

Miao, Chenfeng ^{[1
]}

Zhu, Qingying ^{[1
]}

Chen, Minchuan ^{[1
]}

Ma, Jun ^{[1
]}

Wang, Shaojun ^{[1
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping Technol, Shanghai 200120, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Training; Vectors; Computational modeling; Task analysis; Acoustics; Couplings; Computer architecture; Text-to-speech; speech synthesis; voice conversion; differentiable aligner; VAE; hierarchical-VAE; end-to-end;

D O I：

10.1109/TASLP.2024.3369528

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.

引用

页码：1650 / 1661

页数：12

共 50 条

[31] StreamVoice plus : Evolving Into End-to-End Streaming Zero-Shot Voice Conversion
Wang, Zhichao
Chen, Yuanzhe
Wang, Xinsheng
Xie, Lei
Wang, Yuping
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3000 - 3004
[32] End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions
Kang, Wonjune
Hasegawa-Johnson, Mark
Roy, Deb
INTERSPEECH 2023, 2023, : 2303 - 2307
[33] Advance research in agricultural text-to-speech: the word segmentation of analytic language and the deep learning-based end-to-end system
Li, Xinxing
Ma, Diankun
Yin, Baoquan
COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2021, 180
[34] Towards End-to-End Speech-to-Text Summarization
Monteiro, Raul
Pernes, Diogo
TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 304 - 316
[35] CONTROLLING EMOTION STRENGTH WITH RELATIVE ATTRIBUTE FOR END-TO-END SPEECH SYNTHESIS
Zhu, Xiaolian
Yang, Shan
Yang, Geng
Xie, Lei
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 192 - 199
[36] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
Liu, Da-Rong
Yang, Chi-Yu
Wu, Szu-Lin
Lee, Hung-Yi
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
[37] A COMPARATIVE STUDY ON END-TO-END SPEECH TO TEXT TRANSLATION
Bahar, Parnia
Bieschke, Tobias
Ney, Hermann
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 792 - 799
[38] Analysis of Pronunciation Learning in End-to-End Speech Synthesis
Taylor, Jason
Richmond, Korin
INTERSPEECH 2019, 2019, : 2070 - 2074
[39] Emotion selectable end-to-end text-based speech editing
Wang, Tao
Yi, Jiangyan
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Zhang, Chu Yuan
ARTIFICIAL INTELLIGENCE, 2024, 329
[40] BLSTM-CRF Based End-to-End Prosodic Boundary Prediction with Context Sensitive Embeddings in A Text-to-Speech Front-End
Zheng, Yibin
Tao, Jianhua
Wen, Zhengqi
Li, Ya
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 47 - 51

← 1 2 3 4 5 →