PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引:10
作者
Karlapati, Sri [1 ]
Abbas, Ammar [1 ]
Hodari, Zack [2 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Karanasou, Penny [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
TTS; prosody modelling; contextual prosody;
D O I
10.1109/ICASSP39728.2021.9413696
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
引用
收藏
页码:6573 / 6577
页数:5
相关论文
共 24 条
[1]  
Akuzawa K, 2018, INTERSPEECH, P3067
[2]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[3]   Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTS [J].
Guo, Haohan ;
Soongt, Frank K. ;
Het, Lei ;
Xie, Lei .
INTERSPEECH 2019, 2019, :4460-4464
[4]  
Kalchbrenner N, 2018, PR MACH LEARN RES, V80
[5]   CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech [J].
Karlapati, Sri ;
Moinet, Alexis ;
Joly, Arnaud ;
Klimkov, Viacheslav ;
Sciez-Trigueros, Daniel ;
Drugman, Thomas .
INTERSPEECH 2020, 2020, :4387-4391
[6]   An Empirical Analysis of the Correlation of Syntax and Prosody [J].
Kohn, Arne ;
Baumann, Timo ;
Dorfler, Oskar .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :2157-2161
[7]  
Lee Y, 2019, INT CONF ACOUST SPEE, P5911, DOI [10.1109/ICASSP.2019.8683501, 10.1109/icassp.2019.8683501]
[8]  
Li NH, 2019, AAAI CONF ARTIF INTE, P6706
[9]  
Ming H., 2019, ARXIV190100707
[10]  
Recommendation BS ITU-R, 2003, 15341 BS ITUR