DCTTS: DISCRETE DIFFUSION MODEL WITH CONTRASTIVE LEARNING FOR TEXT-TO-SPEECH GENERATION

被引：1

作者：

Wu, Zhichao ^{[1
]}

Li, Qiulin ^{[1
]}

Liu, Sixing ^{[1
]}

Yang, Qun ^{[1
]}

机构：

[1] Nanjing Univ Aeronaut & Astronaut, Nanjing, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年

关键词：

Text to speech; Discrete diffusion model; Contrastive learning; RTF; MOS;

D O I：

10.1109/ICASSP48485.2024.10447661

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In the Text-to-speech(TTS) task, the latent diffusion model has excellent fidelity and generalization, but its expensive resource consumption and slow inference speed have always been a challenging. To address this issue, this paper proposes the Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation(DCTTS). Specifically, we employs a straightforward and effective text encoder, compresses the raw data into discrete space using VQ model, and then trains the diffusion model on the discrete space. In order to minimize the number of diffusion steps needed to synthesis high-quality speech, we used a contrastive learning loss throughout the diffusion model training phase. The experimental results demonstrate that the approach proposed in this paper has outstanding speech synthesis quality and sampling speed while significantly reducing the resource consumption of diffusion model. The synthesized samples are available at https://github.com/lawtherWu/DCTTS

引用

页码：11336 / 11340

页数：5

共 21 条

[1]

Atienza Rowel, 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P1, DOI 10.1109/ICASSP49357.2023.10094639

[2]

Du Chenpeng, 2022, ARXIV

[3] Taming Transformers for High-Resolution Image Synthesis [J].

Esser, Patrick ;

Rombach, Robin ;

Ommer, Bjoern .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12868-12878

[4] SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].

GRIFFIN, DW ;

LIM, JS .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243

[5] Vector Quantized Diffusion Model for Text-to-Image Synthesis [J].

Gu, Shuyang ;

Chen, Dong ;

Bao, Jianmin ;

Wen, Fang ;

Zhang, Bo ;

Chen, Dongdong ;

Yuan, Lu ;

Guo, Baining .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :10686-10696

[6]

Hendrycks Dan, 2016, P ICLR

[7] ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech [J].

Huang, Rongjie ;

Zhao, Zhou ;

Liu, Huadai ;

Liu, Jinglin ;

Cui, Chenye ;

Ren, Yi .

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :2595-2605

[8]

Huang Rongjie, 2022, arXiv

[9]

Iashin Vladimir, 2021, ARXIV

[10]

Ito Keith, The LJ speech dataset

← 1 2 3 →