INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS

被引：4

作者：

Cornille, Tobias ^{[1
]}

Wang, Fengna ^{[2
]}

Bekker, Jessa ^{[1
]}

机构：

[1] Katholieke Univ Leuven, Leuven, Belgium

[2] Acapela Grp, Mons, Belgium

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

speech synthesis; text-to-speech; prosody; controllability; hierarchical prosody embedding;

D O I：

10.1109/ICASSP43922.2022.9746654

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recent neural-based text-to-speech (TTS) models are able to produce highly natural speech. To synthesize expressive speech, the prosody of the speech has to be modeled, and predicted/controlled during synthesis. However, intuitive control over prosody remains elusive. Some techniques only allow control over the global style of the speech and do not allow fine-grained adjustments. Other techniques create fine-grained prosody embeddings, but these are difficult to manipulate to obtain a desired speaking style. We thus present ConEx, a novel model for expressive speech synthesis, which can produce speech in a certain speaking style, while also allowing local adjustments to the prosody of the generated speech. The model builds upon the non-autoregressive architecture of FastSpeech and includes a reference encoder to learn global prosody embeddings, and a vector quantized variational autoencoder to create fine-grained prosody embeddings. To realize prosody manipulation, a new interactive method is proposed. Experiments on two datasets show that the model enables multi-level prosody control.

引用

页码：8312 / 8316

页数：5

共 30 条

[1]

[Anonymous], 2018, ICML

[2]

[Anonymous], 2018, ICML

[3]

[Anonymous], 2018, ICML

[4]

Chien CM, 2021, HIERARCHICAL PROSODY

[5]

Delbrouck JB, 2020, PROCEEDINGS OF THE SECOND GRAND CHALLENGE AND WORKSHOP ON MULTIMODAL LANGUAGE (CHALLENGE-HML), VOL 1, P1

[6]

Elias Isaac, 2021, ICASSP

[7] SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].

GRIFFIN, DW ;

LIM, JS .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243

[8]

Hayashi T., 2020, ICASSP

[9]

Henter G. E., 2018, ARXIV180711470

[10]

Hsu WN, 2018, HIERARCHICAL GENERAT

← 1 2 3 →