EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

被引：2

作者：

Miao, Chenfeng ^{[1
]}

Zhu, Qingying ^{[1
]}

Chen, Minchuan ^{[1
]}

Ma, Jun ^{[1
]}

Wang, Shaojun ^{[1
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping Technol, Shanghai 200120, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Training; Vectors; Computational modeling; Task analysis; Acoustics; Couplings; Computer architecture; Text-to-speech; speech synthesis; voice conversion; differentiable aligner; VAE; hierarchical-VAE; end-to-end;

D O I：

10.1109/TASLP.2024.3369528

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.

引用

页码：1650 / 1661

页数：12

共 50 条

[1] NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality
Tan, Xu
Chen, Jiawei
Liu, Haohe
Cong, Jian
Zhang, Chen
Liu, Yanqing
Wang, Xi
Leng, Yichong
Yi, Yuanhao
He, Lei
Zhao, Sheng
Qin, Tao
Soong, Frank
Liu, Tie-Yan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4234 - 4245
[2] Myanmar Text-to-Speech Synthesis Using End-to-End Model
Qin, Qinglai
Yang, Jian
Li, Peiying
2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 6 - 11
[3] End-to-End Mongolian Text-to-Speech System
Li, Jingdong
Zhang, Hui
Liu, Rui
Zhang, Xueliang
Bao, Feilong
2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 483 - 487
[4] EXPLORING END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS FOR ROMANIAN
Dumitrache, Marius
Rebedea, Traian
PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING, 2020, : 93 - 102
[5] End-to-End Text-To-Speech synthesis for under resourced South African languages
Nthite, Thapelo
Tsoeu, Mohohlo
2020 INTERNATIONAL SAUPEC/ROBMECH/PRASA CONFERENCE, 2020, : 684 - 689
[6] Effective Emotion Transplantation in an End-to-End Text-to-Speech System
Joo, Young-Sun
Bae, Hanbin
Kim, Young-Ik
Cho, Hoon-Young
Kang, Hong-Goo
IEEE ACCESS, 2020, 8 : 161713 - 161719
[7] End-to-End Thai Text-to-Speech with Linguistic Unit
Wisetpaitoon, Kontawat
Singkul, Sattaya
Sakdejayont, Theerat
Chalothorn, Tawunrat
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 951 - 959
[8] End-to-end text-to-speech synthesis with unaligned multiple language units based on attention
Aso, Masashi
Takamichi, Shinnosuke
Saruwatari, Hiroshi
INTERSPEECH 2020, 2020, : 4009 - 4013
[9] On the Training and Testing Data Preparation for End-to-End Text-to-Speech Application
Duc Chung Tran
Khan, M. K. A. Ahamed
Sridevi, S.
2020 11TH IEEE CONTROL AND SYSTEM GRADUATE RESEARCH COLLOQUIUM (ICSGRC), 2020, : 73 - 75
[10] Multi speaker text-to-speech synthesis using generalized end-to-end loss function
Nazir, Owais
Malik, Aruna
Singh, Samayveer
Pathan, Al-Sakib Khan
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 64205 - 64222

← 1 2 3 4 5 →