Speech Synthesis Adaption Method Based on Phoneme-Level Speaker Embedding Under Small Data

被引:0
作者
Xu Z.-H. [1 ,2 ]
Chen B. [1 ,2 ]
Zhang H. [3 ]
Yu K. [1 ,2 ]
机构
[1] MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai
[2] Lab of Cross-Media Language Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai
[3] AiSpeech Ltd, Suzhou
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2022年 / 45卷 / 05期
关键词
Duration model; Small data; Speaker adaptation; Speaker embedding; Text to speech;
D O I
10.11897/SP.J.1016.2022.01003
中图分类号
学科分类号
摘要
In speech synthesis, the use of a small amount of user-recorded data for speaker adaptation has always been faced with a problem: how to synthesize highly similar speeches without excessively reducing the naturalness of the synthesized speeches. The existing utterance-level and frame-level speaker embedding methods face the problem of low similarity when synthesizing speeches of testing speaker, and the use of a small amount of user recorded data to fine-tune the pre-trained speech synthesis model can improve the similarity of synthesized audio, but it is often accompanied by a decrease in naturalness. To solve this problem, we propose a novel adaptation method for speech synthesis based on phoneme-level speaker embedding. In the training stage, the phoneme-level speaker embedding is extracted from the real feature fragments to control the training of the speech synthesis model. In the adaptation stage, we quickly adapt the speaker embedding predictor network, replacing the real audio in the inference stage to obtain phoneme-level speaker embedding. We use a small amount of real user-recorded data to conduct experiments, and compare the performance of common speaker embedding methods in different grains. Experiments show that compared with various speaker embedding methods, our method maintains no significant decrease in naturalness without updating the speech synthesis model, and achieves the best similarity; in the case of updating the speech synthesis model, our method achieves the best naturalness and similarity at the same time. The analysis found that the phoneme-level speaker embedding method provides a better initial point of model adaptation without increasing the adaptive training time, and effectively improves the quality of the synthesized speeches of the adaptive model. © 2022, Science Press. All right reserved.
引用
收藏
页码:1003 / 1017
页数:14
相关论文
共 48 条
[1]  
Wang Y, Skerry-Ryan R, Stanton D, Et al., Tacotron: Towards end-to-end speech synthesis, Proceedings of the 18th Annual Conference of the International Speech Communication Association, pp. 4006-4010, (2017)
[2]  
Gibiansky A, Arik S, Diamos G, Et al., Deep voice 2: Multi-speaker neural text-to-speech, Advances in Neural Information Processing Systems, pp. 2962-2970, (2017)
[3]  
Taigman Y, Wolf L, Polyak A, Et al., VoiceLoop: Voice fitting and synthesis via a phonological loop, (2017)
[4]  
Oord A V D, Dieleman S, Zen H, Et al., WaveNet: A generative model for raw audio, Proceedings of the 9th ISCA Speech Synthesis Workshop, (2016)
[5]  
Chung Y A, Wang Y, Hsu W N, Et al., Semi-supervised training for improving data efficiency in end-to-end speech synthesis, Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP 2019), pp. 6940-6944, (2019)
[6]  
Fong J, Gallegos P O, Hodari Z, Et al., Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data, Proceedings of the 20th Annual Conference of the International Speech Communication Association, pp. 1546-1550, (2019)
[7]  
Jia Y, Zhang Y, Weiss R, Et al., Transfer learning from speaker verification to multispeaker text-to-speech synthesis, Advances in Neural Information Processing Systems, pp. 4480-4490, (2018)
[8]  
Chen Y, Assael Y, Shillingford B, Et al., Sample efficient adaptive text-to-speech, (2018)
[9]  
Kalchbrenner N, Elsen E, Simonyan K, Et al., Efficient neural audio synthesis, Proceedings of the 35th International Conference on Machine Learning, pp. 2415-2424, (2018)
[10]  
Nachmani E, Polyak A, Taigman Y, Et al., Fitting new speakers based on a short untranscribed sample, Proceedings of the 35th International Conference on Machine Learning, pp. 3680-3688, (2018)