A New GAN-based End-to-End TTS Training Algorithm

被引:10
作者
Guo, Haohan [1 ]
Soong, Frank K. [2 ]
He, Lei [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
来源
INTERSPEECH 2019 | 2019年
关键词
speech synthesis; end-to-end TTS synthesis; auto-regressive model; generative adversarial model; adversarial training;
D O I
10.21437/Interspeech.2019-2176
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
End-to-end, autoregressive model-based TTS has shown significant performance improvements over the conventional ones. However, the autoregressive module training is affected by the exposure bias, or the mismatch between different distributions of real and predicted data. While real data is provided in training, in testing, predicted data is available only. By introducing both real and generated data sequences in training, we can alleviate the effects of the exposure bias. We propose to use Generative Adversarial Network (GAN) along with the idea of "Professor Forcing" in training. A discriminator in GAN is jointly trained to equalize the difference between real and the predicted data. In AB subjective listening test, the results show that the new approach is preferred over the standard transfer learning with a CMOS improvement of 0.1. Sentence level intelligibility tests also show significant improvement in a pathological test set. The GAN-trained new model is shown more stable than the baseline to produce better alignments for the Tacotron output.
引用
收藏
页码:1288 / 1292
页数:5
相关论文
共 28 条
[1]  
[Anonymous], 2018, ICML
[2]  
Bengio S, 2015, ADV NEUR IN, V28
[3]  
Fan Y., 2014, INTERSPEECH
[4]  
Ganin Y, 2016, J MACH LEARN RES, V17
[5]  
Goyal A, 2016, ADV NEUR IN, V29
[6]  
Huszar Ferenc, 2015, ARXIV151105101
[7]   Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks [J].
Kaneko, Takuhiro ;
Kameoka, Hirokazu ;
Hiramatsu, Kaoru ;
Kashino, Kunio .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1283-1287
[8]  
Li N., 2019, AAAI
[9]  
Miyato T., 2017, ICLR, P1
[10]  
Miyato Takeru, 2018, ICLR, DOI DOI 10.48550/ARXIV.1802.05957