Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

被引：0

作者：

Zhang, Guangyan ^{[1
,2
,5
]}

Merritt, Thomas ^{[1
,5
]}

Ribeiro, Manuel Sam ^{[1
,5
]}

Tura-Vecino, Biel ^{[1
,5
]}

Yanagisawa, Kayoko ^{[1
,5
]}

Pokora, Kamil ^{[1
,5
]}

Ezzerg, Abdelhamid ^{[1
,5
]}

Cygert, Sebastian ^{[1
,3
,5
]}

Abbas, Ammar ^{[1
,5
]}

Bilinski, Piotr ^{[1
,4
,5
]}

Barra-Chicote, Roberto ^{[1
,5
]}

Korzekwa, Daniel ^{[1
,5
]}

Lorenzo-Trueba, Jaime ^{[1
,5
]}

机构：

[1] Amazon TTS, Bangalore, Karnataka, India

[2] Chinese Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China

[3] Gdansk Univ Technol, Gdansk, Poland

[4] Univ Warsaw, Warsaw, Poland

[5] Amazon TTS Res, Bangalore, Karnataka, India

来源：

INTERSPEECH 2023 | 2023年

关键词：

text-to-speech; prosody modelling; acoustic model; normalizing flows; diffusion;

D O I：

10.21437/Interspeech.2023-834

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.

引用

页码：27 / 31

页数：5

共 26 条

[11]

Kim Jaehyeon, 2020, Advances in Neural Information Processing Systems, V33

[12]

Kingma DP, 2018, ADV NEUR IN, V31

[13] FASTPITCH: PARALLEL TEXT-TO-SPEECH WITH PITCH PREDICTION [J].

Lancucki, Adrian .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6588-6592

[14] TEXT-FREE NON-PARALLEL MANY-TO-MANY VOICE CONVERSION USING NORMALISING FLOWS [J].

Merritt, Thomas ;

Ezzerg, Abdelhamid ;

Bilinski, Piotr ;

Proszewska, Magdalena ;

Pokora, Kamil ;

Barra-Chicote, Roberto ;

Korzekwa, Daniel .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6782-6786

[15]

Miao CF, 2020, INT CONF ACOUST SPEE, P7209, DOI [10.1109/ICASSP40776.2020.9054484, 10.1109/icassp40776.2020.9054484]

[16]

Popov Vadim, 2021, PR MACH LEARN RES, V139

[17]

Ren Yi, 2020, INT C LEARNING REPRE

[18] U-Net: Convolutional Networks for Biomedical Image Segmentation [J].

Ronneberger, Olaf ;

Fischer, Philipp ;

Brox, Thomas .

MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION, PT III, 2015, 9351 :234-241

[19]

Shah R., 2021, P 11 ISCA SPEECH SYN

[20]

Sheng L., 2019, 2019 INT MULT ENG CO

← 1 2 3 →