CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

被引:9
作者
Ye, Zhen [1 ]
Xue, Wei [1 ]
Tan, Xu [2 ]
Chen, Jie [3 ]
Liu, Qifeng [1 ]
Guo, Yike [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
[3] Hong Kong Baptist Univ, Hong Kong, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
Text-to-speech; Singing Voice Synthesis; Diffusion Model; Consistency Model;
D O I
10.1145/3581783.3612061
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples and codes are available at https://comospeech.github.io/.
引用
收藏
页码:1831 / 1839
页数:9
相关论文
共 41 条
[1]  
Anderson B. D., 1982, Stochastic Processes and their Applications, V12, P313
[2]  
Blaauw M, 2020, INT CONF ACOUST SPEE, P7229, DOI [10.1109/icassp40776.2020.9053944, 10.1109/ICASSP40776.2020.9053944]
[3]  
Chen J., 2020, ARXIV200901776
[4]  
Chen Sitan, 2022, ARXIV220911215
[5]  
Chen Zehua, 2022, ARXIV221214518
[6]  
Chu M., 2006, US Patent, Patent No. 7024362
[7]  
Daras G., 2023, ARXIV230209057
[8]  
Heusel M, 2017, ADV NEUR IN, V30
[9]  
Ho Jonathan., 2020, Adv neural Inf Process Syst, V33, P6840
[10]   ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech [J].
Huang, Rongjie ;
Zhao, Zhou ;
Liu, Huadai ;
Liu, Jinglin ;
Cui, Chenye ;
Ren, Yi .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :2595-2605