EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

被引：2

作者：

Miao, Chenfeng ^{[1
]}

Zhu, Qingying ^{[1
]}

Chen, Minchuan ^{[1
]}

Ma, Jun ^{[1
]}

Wang, Shaojun ^{[1
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping Technol, Shanghai 200120, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Training; Vectors; Computational modeling; Task analysis; Acoustics; Couplings; Computer architecture; Text-to-speech; speech synthesis; voice conversion; differentiable aligner; VAE; hierarchical-VAE; end-to-end;

D O I：

10.1109/TASLP.2024.3369528

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.

引用

页码：1650 / 1661

页数：12

共 50 条

[41] End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks
Mira, Rodrigo
Vougioukas, Konstantinos
Ma, Pingchuan
Petridis, Stavros
Schuller, Bjoern W.
Pantic, Maja
IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (06) : 3454 - 3466
[42] Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis
Li, Tao
Wang, Xinsheng
Xie, Qicong
Wang, Zhichao
Xie, Lei
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1448 - 1460
[43] Multitask Training with Text Data for End-to-End Speech Recognition
Wang, Peidong
Sainath, Tara N.
Weiss, Ron J.
INTERSPEECH 2021, 2021, : 2566 - 2570
[44] NVC-NET: END-TO-END ADVERSARIAL VOICE CONVERSION
Nguyen, Bac
Cardinaux, Fabien
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7012 - 7016
[45] END-TO-END VOICE CONVERSION VIA CROSS-MODAL KNOWLEDGE DISTILLATION FOR DYSARTHRIC SPEECH RECONSTRUCTION
Wang, Disong
Yu, Jianwei
Wu, Xixin
Liu, Songxiang
Sung, Lifa
Liu, Xunying
Meng, Helen
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7744 - 7748
[46] Corpus generation for voice command in smart home and the effect of speech synthesis on End-to-End SLU
Desot, Thierry
Portet, Francois
Vacher, Michel
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6395 - 6404
[47] EXTENDING PARROTRON: AN END-TO-END, SPEECH CONVERSION AND SPEECH RECOGNITION MODEL FOR ATYPICAL SPEECH
Doshi, Rohan
Chen, Youzheng
Jiang, Liyang
Zhang, Xia
Biadsy, Fadi
Ramabhadran, Bhuvana
Chu, Fang
Rosenberg, Andrew
Moreno, Pedro J.
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6988 - 6992
[48] Memory Attention: Robust Alignment Using Gating Mechanism for End-to-End Speech Synthesis
Lee, Joun Yeop
Cheon, Sung Jun
Choi, Byoung Jin
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 2004 - 2008
[49] Vocoder-free End-to-End Voice Conversion with Transformer Network
Kim, June-Woo
Jung, Ho-Young
Lee, Minho
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[50] Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion
Liu, Andy T.
Hsu, Po-chun
Lee, Hung-yi
INTERSPEECH 2019, 2019, : 1108 - 1112

← 1 2 3 4 5 →