Fine-grained prosody modeling in neural speech synthesis using ToBI representation

被引:8
作者
Zou, Yuxiang [1 ]
Liu, Shichao [1 ]
Yin, Xiang [1 ]
Lin, Haopeng [1 ]
Wang, Chunfeng [1 ]
Zhang, Haoyu [1 ]
Ma, Zejun [1 ]
机构
[1] Bytedance AI Lab, Beijing, Peoples R China
来源
INTERSPEECH 2021 | 2021年
关键词
speech synthesis; ToBI representation; prosody modeling; intonation; stress and pause control;
D O I
10.21437/Interspeech.2021-883
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Benefiting from the great development of deep learning, modern neural text-to-speech (TTS) models can generate speech indistinguishable from natural speech. However, The generated utterances often keep an average prosodic style of the database instead of having rich prosodic variation. For pitch-stressed languages, such as English, accurate intonation and stress are important for conveying semantic information. In this work, we propose a fine-grained prosody modeling method in neural speech synthesis with ToBI (Tones and Break Indices) representation. The proposed system consists of a text frontend for ToBI prediction and a Tacotron-based TTS module for prosody modeling. By introducing the ToBI representation, we can control the system to synthesize speech with accurate intonation and stress at syllable level. Compared with the two baselines (Tacotron and unsupervised method), experiments show that our model can generate more natural speech with more accurate prosody, as well as effectively control the stress, intonation, and pause of the speech.
引用
收藏
页码:3146 / 3150
页数:5
相关论文
共 24 条
[1]  
[Anonymous], 2018, INT C MACH LEARN ICM
[2]  
Clark Kevin, 2020, ELECTRA PRETRAINING, DOI DOI 10.48550/ARXIV.2003.10555
[3]  
De C.A., 1998, Intonation Systems: a Survey of Twenty Languages
[4]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[6]  
Hsu W.-N., 2019, ICLR
[7]  
Kalchbrenner N, 2018, PR MACH LEARN RES, V80
[8]  
Kenter T., 2019, ICML
[9]   Improving the Prosody of RNN-based English Text-To-Speech Synthesis by Incorporating a BERT model [J].
Kenter, Tom ;
Sharma, Manish ;
Clark, Rob .
INTERSPEECH 2020, 2020, :4412-4416
[10]  
Lee Y, 2019, INT CONF ACOUST SPEE, P5911, DOI [10.1109/ICASSP.2019.8683501, 10.1109/icassp.2019.8683501]