Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

被引:0
作者
Min, Dongchan [1 ]
Lee, Dong Bok [1 ]
Yang, Eunho [1 ,2 ]
Hwang, Sung Ju [1 ,2 ]
机构
[1] Korea Adv Inst Sci & Technol KAIST, Grad Sch AI, Seoul, South Korea
[2] AITRICS, Seoul, South Korea
来源
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | 2021年 / 139卷
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.
引用
收藏
页数:12
相关论文
共 48 条
[1]  
Amodei D, 2016, PR MACH LEARN RES, V48
[2]  
Arik S., P 34 INT C
[3]  
Arik SÖ, 2017, ADV NEUR IN, V30
[4]  
Arik SÖ, 2018, ADV NEUR IN, V31
[5]  
Ba Jimmy Lei, 2016, arXiv, DOI DOI 10.48550/ARXIV.1607.06450
[6]  
Bartunov S., 2018, P MACHINE LEARNING R, V84, P670
[7]  
Chen M., 2021, INT C LEARN REPR
[8]  
Chen M., 2020, INTERSPEECH
[9]  
Chen Y., 2019, 7 INT C LEARN REPR I
[10]  
Clou^atre L., 2019, ABS190102199 ARXIV