Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

被引:0
作者
Zhang, Guangyan [1 ,2 ,5 ]
Merritt, Thomas [1 ,5 ]
Ribeiro, Manuel Sam [1 ,5 ]
Tura-Vecino, Biel [1 ,5 ]
Yanagisawa, Kayoko [1 ,5 ]
Pokora, Kamil [1 ,5 ]
Ezzerg, Abdelhamid [1 ,5 ]
Cygert, Sebastian [1 ,3 ,5 ]
Abbas, Ammar [1 ,5 ]
Bilinski, Piotr [1 ,4 ,5 ]
Barra-Chicote, Roberto [1 ,5 ]
Korzekwa, Daniel [1 ,5 ]
Lorenzo-Trueba, Jaime [1 ,5 ]
机构
[1] Amazon TTS, Bangalore, Karnataka, India
[2] Chinese Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China
[3] Gdansk Univ Technol, Gdansk, Poland
[4] Univ Warsaw, Warsaw, Poland
[5] Amazon TTS Res, Bangalore, Karnataka, India
来源
INTERSPEECH 2023 | 2023年
关键词
text-to-speech; prosody modelling; acoustic model; normalizing flows; diffusion;
D O I
10.21437/Interspeech.2023-834
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.
引用
收藏
页码:27 / 31
页数:5
相关论文
共 26 条
[1]   Expressive, Variable, and Controllable Duration Modelling in TTS [J].
Abbas, Ammar ;
Merritt, Thomas ;
Moinet, Alexis ;
Karlapati, Sri ;
Muszynska, Ewa ;
Slangen, Simon ;
Gatti, Elia ;
Drugman, Thomas .
INTERSPEECH 2022, 2022, :4546-4550
[2]   Creating New Voices using Normalizing Flows [J].
Bilinski, Piotr ;
Merritt, Thomas ;
Ezzerg, Abdelhamid ;
Pokora, Kamil ;
Cygert, Sebastian ;
Yanagisawa, Kayoko ;
Barra-Chicote, Roberto ;
Korzekwa, Daniel .
INTERSPEECH 2022, 2022, :2958-2962
[3]  
Clark R. A., 1999, INT C PHON SCI
[4]   PARALLEL TACOTRON: NON-AUTOREGRESSIVE AND CONTROLLABLE TTS [J].
Elias, Isaac ;
Zen, Heiga ;
Shen, Jonathan ;
Zhang, Yu ;
Jia, Ye ;
Weiss, Ron J. ;
Wu, Yonghui .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5709-5713
[5]   REMAP, WARP AND ATTEND: NON-PARALLEL MANY-TO-MANY ACCENT CONVERSION WITH NORMALIZING FLOWS [J].
Ezzerg, Abdelhamid ;
Merritt, Thomas ;
Yanagisawa, Kayoko ;
Bilinski, Piotr ;
Proszewska, Magdalena ;
Pokora, Kamil ;
Korzeniowski, Renard ;
Barra-Chicote, Roberto ;
Korzekwa, Daniel .
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, :984-990
[6]  
Hodari Z., 2019, PROC 10 ISCA SPEECH, P239, DOI 10.21437/SSW.2019-43
[7]  
Jeong M., 2021, P INTERSPEECH
[8]   UNIVERSAL NEURAL VOCODING WITH PARALLEL WAVENET [J].
Jiao, Yunlong ;
Gabrys, Adam ;
Tinchev, Georgi ;
Putrycz, Bartosz ;
Korzekwa, Daniel ;
Klimkov, Viacheslav .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6044-6048
[9]   Processing emotional pictures and words: Effects of valence and arousal [J].
Kensinger, Elizabeth A. ;
Schacter, Daniel L. .
COGNITIVE AFFECTIVE & BEHAVIORAL NEUROSCIENCE, 2006, 6 (02) :110-126
[10]  
Kim Jaehyeon, 2020, ADV NEURAL INFORM PR, V33, P8067