EFFECT OF DATA REDUCTION ON SEQUENCE-TO-SEQUENCE NEURAL TTS

被引:0
作者
Latorre, Javier [1 ]
Lachowicz, Jakub [1 ]
Lorenzo-Trueba, Jaime [1 ]
Merritt, Thomas [1 ]
Drugman, Thomas [1 ]
Ronanki, Srikanth [1 ]
Klimkov, Viacheslav [1 ]
机构
[1] Amazon Com, Seattle, WA 98109 USA
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
statistical parametric speech synthesis; autoregressive; neural vocoder; generative models; sequence-to-sequence;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent speech synthesis systems based on sampling from autoregressive neural network models can generate speech almost indistinguishable from human recordings. However, these models require large amounts of data. This paper shows that the lack of data from one speaker can be compensated with data from other speakers. The naturalness of Tacotron2-like models trained on a blend of 5k utterances from 7 speakers is better than or equivalent to that of speaker dependent models trained on 15k utterances. Additionally, in terms of stability multispeaker models are always more stable. We also demonstrate that models mixing only 1250 utterances from a target speaker with 5k utterances from another 6 speakers can produce significantly better quality than state-of-the-art DNN-guided unit selection systems trained on more than 10 times the data from the target speaker.
引用
收藏
页码:7075 / 7079
页数:5
相关论文
共 22 条
[1]  
[Anonymous], 2018, ARXIV180208435
[2]  
[Anonymous], PROC
[3]  
[Anonymous], 2016, P 9 ISCA WORKSH SPEE
[4]  
[Anonymous], 2017, P INT C LEARN REPR
[5]  
[Anonymous], 2014, 3 INT C LEARN REPR
[6]  
[Anonymous], INT C LEARN REPR
[7]  
Arik S. O., 2017, ARXIV170207825V2
[8]  
Clark R. A. J., 2007, P BLIZZ 2007
[9]  
Gibiansky Andrew, 2017, P NIPS, P2966
[10]  
International Telecommunication Union, 2001, METH SUBJ ASS INT SO