Expressive, Variable, and Controllable Duration Modelling in TTS

被引：6

作者：

Abbas, Ammar ^{[1
]}

Merritt, Thomas ^{[1
]}

Moinet, Alexis ^{[1
]}

Karlapati, Sri ^{[1
]}

Muszynska, Ewa ^{[1
]}

Slangen, Simon ^{[1
]}

Gatti, Elia ^{[1
]}

Drugman, Thomas ^{[1
]}

机构：

[1] Amazon, Alexa AI, Mountain View, CA 94043 USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

neural text-to-speech; normalising flows; expressive TTS; duration modelling;

D O I：

10.21437/Interspeech.2022-384

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve duration modelling. First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses. We show that the duration model conditioned on phrasing improves the naturalness of speech over our baseline duration model. Second, we also propose a multi-speaker duration model called Cauliflow, that uses normalising flows to predict durations that better match the complex target duration distribution. Cauliflow performs on par with our other proposed duration model in terms of naturalness, whilst providing variable durations for the same prompt and variable levels of expressiveness. Lastly, we propose to condition Cauliflow on parameters that provide an intuitive control of the pacing and pausing in the synthesised speech in a novel way.

引用

页码：4546 / 4550

页数：5

共 29 条

[1]

B. Series, 2014, METH SUBJ ASS INT QU

[2] The Syntax-Prosody Interface [J].

Bennett, Ryan ;

Elfner, Emily .

ANNUAL REVIEW OF LINGUISTICS, VOL 5, 2019, 5 :151-171

[3]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[4]

Dinh L, 2017, PR MACH LEARN RES, V70

[5]

Elias I., 2021, ARXIV210314574

[6] UNIVERSAL NEURAL VOCODING WITH PARALLEL WAVENET [J].

Jiao, Yunlong ;

Gabrys, Adam ;

Tinchev, Georgi ;

Putrycz, Bartosz ;

Korzekwa, Daniel ;

Klimkov, Viacheslav .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6044-6048

[7]

Kalchbrenner N, 2018, PR MACH LEARN RES, V80

[8] PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH [J].

Karlapati, Sri ;

Abbas, Ammar ;

Hodari, Zack ;

Moinet, Alexis ;

Joly, Arnaud ;

Karanasou, Penny ;

Drugman, Thomas .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6573-6577

[9]

Kim J., 2020, P NEURIPS, P8067

[10] Propagation Characteristics of an Industrial Environment Channel at 4.1 GHz [J].

Kim, Junseok ;

Kim, Chung-Sup ;

Hong, Ju-Yeon ;

Lim, Jong-Soo ;

Chong, Young-Jun .

12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, :1530-1532

← 1 2 3 →