CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

被引：9

作者：

Ye, Zhen ^{[1
]}

Xue, Wei ^{[1
]}

Tan, Xu ^{[2
]}

Chen, Jie ^{[3
]}

Liu, Qifeng ^{[1
]}

Guo, Yike ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[2] Microsoft Res Asia, Beijing, Peoples R China

[3] Hong Kong Baptist Univ, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

Text-to-speech; Singing Voice Synthesis; Diffusion Model; Consistency Model;

D O I：

10.1145/3581783.3612061

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples and codes are available at https://comospeech.github.io/.

引用

页码：1831 / 1839

页数：9

共 41 条

[1]

Anderson B. D., 1982, Stochastic Processes and their Applications, V12, P313

[2]

Blaauw M, 2020, INT CONF ACOUST SPEE, P7229, DOI [10.1109/icassp40776.2020.9053944, 10.1109/ICASSP40776.2020.9053944]

[3]

Chen J., 2020, ARXIV200901776

[4]

Chen Sitan, 2022, ARXIV220911215

[5]

Chen Zehua, 2022, ARXIV221214518

[6]

Chu M., 2006, US Patent, Patent No. 7024362

[7]

Daras G., 2023, ARXIV230209057

[8]

Heusel M, 2017, ADV NEUR IN, V30

[9]

Ho Jonathan., 2020, Adv neural Inf Process Syst, V33, P6840

[10] ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech [J].

Huang, Rongjie ;

Zhao, Zhou ;

Liu, Huadai ;

Liu, Jinglin ;

Cui, Chenye ;

Ren, Yi .

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :2595-2605

← 1 2 3 4 5 →