Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0

被引:0
作者
Corkey, Niamh [1 ]
O'Mahony, Johannah [1 ]
King, Simon [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
来源
INTERSPEECH 2023 | 2023年
关键词
text-to-speech; speech synthesis; intonation modelling; prosody control; prosody transfer;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present a novel, user-friendly approach for controlling patterns of intonation (a fundamental aspect of prosody) within a neural TTS system. This involves concisely representing F0 contours with the coefficients of their Legendre polynomial series expansion, and implementing a model (based on FastPitch) which is conditioned on these sets of coefficients during training. At inference time the model will explicitly predict a coefficient set, or a user (eg. human-in-the-loop) can provide a target coefficient set which audibly alters the intonation of the output speech, based on just a few values. This is particularly effective for intonation transfer: where these coefficient targets are extracted from a ground truth recording, making the synthesised utterance more closely reflect the intonation of the real speaker.
引用
收藏
页码:2014 / 2015
页数:2
相关论文
共 5 条
[1]  
Ito Keith, 2017, The LJ Speech dataset
[2]  
Kudrynski K., 2020, DEEP LEARNING EXAMPL
[3]   FASTPITCH: PARALLEL TEXT-TO-SPEECH WITH PITCH PREDICTION [J].
Lancucki, Adrian .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6588-6592
[4]  
Wang Y., 2018, WIRELESS COMMUN MOBI, P1
[5]  
Zhang YJ, 2019, INT CONF ACOUST SPEE, P6945, DOI [10.1109/icassp.2019.8683623, 10.1109/ICASSP.2019.8683623]