BOFFIN TTS: FEW-SHOT SPEAKER ADAPTATION BY BAYESIAN OPTIMIZATION

被引:0
作者
Moss, Henry B. [1 ]
Aggarwal, Vatsal [2 ]
Prateek, Nishant [2 ]
Gonzalez, Javier [2 ]
Barra-Chicote, Roberto [2 ]
机构
[1] Univ Lancaster, STOR i Ctr Doctoral Training, Lancaster, England
[2] Amazon Res Cambridge, Cambridge, England
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
text-to-speech; speaker adaptation; Bayesian optimization; transfer learning;
D O I
10.1109/icassp40776.2020.9054301
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.
引用
收藏
页码:7639 / 7643
页数:5
相关论文
共 26 条
[1]  
[Anonymous], EMNLP
[2]  
Arik Sercan Omer, 2018, P ADV NEURAL INFORM, P10040
[3]  
Bergstra J., 2013, INT C MACHINE LEARNI, V28, P115
[4]  
Bergstra J, 2012, J MACH LEARN RES, V13, P281
[5]  
Cao Y., 2017, ARXIV170502304
[6]  
Chen Y., 2018, ARXIV180910460
[7]  
Chen Y., 2018, ARXIV181206855
[8]  
Chung Yu-An, 2019, ICASSP
[9]  
Gibiansky Andrew, 2017, NEURIPS
[10]  
ITUR Recommendation, 2001, METH SUBJ ASS INT SO