Expressive, Variable, and Controllable Duration Modelling in TTS

被引:6
作者
Abbas, Ammar [1 ]
Merritt, Thomas [1 ]
Moinet, Alexis [1 ]
Karlapati, Sri [1 ]
Muszynska, Ewa [1 ]
Slangen, Simon [1 ]
Gatti, Elia [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon, Alexa AI, Mountain View, CA 94043 USA
来源
INTERSPEECH 2022 | 2022年
关键词
neural text-to-speech; normalising flows; expressive TTS; duration modelling;
D O I
10.21437/Interspeech.2022-384
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve duration modelling. First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses. We show that the duration model conditioned on phrasing improves the naturalness of speech over our baseline duration model. Second, we also propose a multi-speaker duration model called Cauliflow, that uses normalising flows to predict durations that better match the complex target duration distribution. Cauliflow performs on par with our other proposed duration model in terms of naturalness, whilst providing variable durations for the same prompt and variable levels of expressiveness. Lastly, we propose to condition Cauliflow on parameters that provide an intuitive control of the pacing and pausing in the synthesised speech in a novel way.
引用
收藏
页码:4546 / 4550
页数:5
相关论文
共 29 条
[1]  
B. Series, 2014, METH SUBJ ASS INT QU
[2]   The Syntax-Prosody Interface [J].
Bennett, Ryan ;
Elfner, Emily .
ANNUAL REVIEW OF LINGUISTICS, VOL 5, 2019, 5 :151-171
[3]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[4]  
Dinh L, 2017, PR MACH LEARN RES, V70
[5]  
Elias I., 2021, ARXIV210314574
[6]   UNIVERSAL NEURAL VOCODING WITH PARALLEL WAVENET [J].
Jiao, Yunlong ;
Gabrys, Adam ;
Tinchev, Georgi ;
Putrycz, Bartosz ;
Korzekwa, Daniel ;
Klimkov, Viacheslav .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6044-6048
[7]  
Kalchbrenner N, 2018, PR MACH LEARN RES, V80
[8]   PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH [J].
Karlapati, Sri ;
Abbas, Ammar ;
Hodari, Zack ;
Moinet, Alexis ;
Joly, Arnaud ;
Karanasou, Penny ;
Drugman, Thomas .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6573-6577
[9]  
Kim J., 2020, P NEURIPS, P8067
[10]   Propagation Characteristics of an Industrial Environment Channel at 4.1 GHz [J].
Kim, Junseok ;
Kim, Chung-Sup ;
Hong, Ju-Yeon ;
Lim, Jong-Soo ;
Chong, Young-Jun .
12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, :1530-1532