INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS

被引:4
作者
Cornille, Tobias [1 ]
Wang, Fengna [2 ]
Bekker, Jessa [1 ]
机构
[1] Katholieke Univ Leuven, Leuven, Belgium
[2] Acapela Grp, Mons, Belgium
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
speech synthesis; text-to-speech; prosody; controllability; hierarchical prosody embedding;
D O I
10.1109/ICASSP43922.2022.9746654
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent neural-based text-to-speech (TTS) models are able to produce highly natural speech. To synthesize expressive speech, the prosody of the speech has to be modeled, and predicted/controlled during synthesis. However, intuitive control over prosody remains elusive. Some techniques only allow control over the global style of the speech and do not allow fine-grained adjustments. Other techniques create fine-grained prosody embeddings, but these are difficult to manipulate to obtain a desired speaking style. We thus present ConEx, a novel model for expressive speech synthesis, which can produce speech in a certain speaking style, while also allowing local adjustments to the prosody of the generated speech. The model builds upon the non-autoregressive architecture of FastSpeech and includes a reference encoder to learn global prosody embeddings, and a vector quantized variational autoencoder to create fine-grained prosody embeddings. To realize prosody manipulation, a new interactive method is proposed. Experiments on two datasets show that the model enables multi-level prosody control.
引用
收藏
页码:8312 / 8316
页数:5
相关论文
共 30 条
[1]  
[Anonymous], 2018, ICML
[2]  
[Anonymous], 2018, ICML
[3]  
[Anonymous], 2018, ICML
[4]  
Chien CM, 2021, HIERARCHICAL PROSODY
[5]  
Delbrouck JB, 2020, PROCEEDINGS OF THE SECOND GRAND CHALLENGE AND WORKSHOP ON MULTIMODAL LANGUAGE (CHALLENGE-HML), VOL 1, P1
[6]  
Elias Isaac, 2021, ICASSP
[7]   SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].
GRIFFIN, DW ;
LIM, JS .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243
[8]  
Hayashi T., 2020, ICASSP
[9]  
Henter G. E., 2018, ARXIV180711470
[10]  
Hsu WN, 2018, HIERARCHICAL GENERAT