Transfer Learning based Progressive Neural Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis

被引:2
作者
Fu, Ruibo [1 ,2 ]
Tao, Jianhua [1 ,2 ,3 ]
Zheng, Yibin [1 ,2 ]
Wen, Zhengqi [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China
来源
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES | 2018年
基金
中国国家自然科学基金;
关键词
speech synthesis; progressive neural networks; acoustic modeling; transfer learning;
D O I
10.21437/Interspeech.2018-1265
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The fundamental frequency and the spectrum parameters of the speech are correlated thus one of their learned mapping from the linguistic features can be leveraged to help determine the other. The conventional methods treated all the acoustic features as one stream for acoustic modeling. And the multi-task learning methods were applied to acoustic modeling with several targets in a global cost function. To improve the accuracy of the acoustic model, the progressive deep neural networks (PDNN) is applied for acoustic modeling in statistical parametric speech synthesis (SPSS) in our method. Each type of the acoustic features is modeled in different sub-networks with its own cost function and the knowledge transfers through lateral connections. Each sub-network in the PDNN can be trained step by step to reach its own optimum. Experiments are conducted to compare the proposed PDNN-based SPSS system with the standard DNN methods. The multi-task learning (MTL) method is also applied to the structure of PDNN and DNN as the contrast experiment of the transfer learning. The computational complexity, prediction sequences and quantity of hierarchies of the PDNN are investigated. Both objective and subjective experimental results demonstrate the effectiveness of the proposed technique.
引用
收藏
页码:907 / 911
页数:5
相关论文
共 26 条
  • [1] Abadi M., 2016, TENSORFLOW LARGESCAL
  • [2] [Anonymous], ICLR2017 WORKS UNPUB
  • [3] [Anonymous], INT CONF ACOUST SPEE
  • [4] [Anonymous], ARXIV160604671
  • [5] Arik SO, 2017, PR MACH LEARN RES, V70
  • [6] Arthi S, 2015, INT CONF ACOUST SPEE, P4240, DOI 10.1109/ICASSP.2015.7178770
  • [7] Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System
    Capes, Tim
    Coles, Paul
    Conkie, Alistair
    Golipour, Ladan
    Hadjitarkhani, Abie
    Hu, Qiong
    Huddleston, Nancy
    Hunt, Melvyn
    Li, Jiangchuan
    Neeracher, Matthias
    Prahallad, Kishore
    Raitio, Tuomo
    Rasipuram, Ramya
    Townsend, Greg
    Williamson, Becci
    Winarsky, David
    Wu, Zhizheng
    Zhang, Hepeng
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 4011 - 4015
  • [8] A hidden Markov-model-based trainable speech synthesizer
    Donovan, RE
    Woodland, PC
    [J]. COMPUTER SPEECH AND LANGUAGE, 1999, 13 (03) : 223 - 241
  • [9] Fan Y., 2014, INTERSPEECH 2014 ANN
  • [10] Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110