Improving Deep Neural Network Based Speech Synthesis through Contextual Feature Parametrization and Multi-Task Learning

被引:0
作者
Zhengqi Wen
Kehuang Li
Zhen Huang
Chin-Hui Lee
Jianhua Tao
机构
[1] National Laboratory of Pattern Recognition,School of Electrical and Computer Engineering
[2] Georgia Institute of Technology,CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation
[3] Chinese Academy of Science,School of Computer and Control Engineering
[4] University of Chinese Academy of Sciences,undefined
来源
Journal of Signal Processing Systems | 2018年 / 90卷
关键词
DNN-based speech synthesis; Vocoder; Speech parametrization; BLSTM; Phoneme embedded vector; Multi-task learning; Pitch-scaled spectrum;
D O I
暂无
中图分类号
学科分类号
摘要
We propose three techniques to improve speech synthesis based on deep neural network (DNN). First, at the DNN input we use real-valued contextual feature vector to represent phoneme identity, part of speech and pause information instead of the conventional binary vector. Second, at the DNN output layer, parameters for pitch-scaled spectrum and aperiodicity measures are estimated for constructing the excitation signal used in our baseline synthesis vocoder. Third, the bidirectional recurrent neural network architecture with long short term memory (BLSTM) units is adopted and trained with multi-task learning for DNN-based speech synthesis. Experimental results demonstrate that the quality of synthesized speech has been improved by adopting the new input vector and output parameters. The proposed BLSTM architecture for DNN is also beneficial to learning the mapping function from the input contextual feature to the speech parameters and to improve speech quality.
引用
收藏
页码:1025 / 1037
页数:12
相关论文
共 83 条
[1]  
Hinton G(2006)A Fast Learning Algorithm for Deep Belief Nets Neural Computation 18 1527-1554
[2]  
Osindero S(2007)Learning multiple layers of representation Trends in Cognitive Sciences 11 428-434
[3]  
Teh Y(1989)Backpropagation applied to handwritten zip code recognition Neural Computation 1 541-551
[4]  
Hinton G-E(1997)Bidirectional recurrent neural networks IEEE Transactions on Signal Processing 45 2673-2681
[5]  
LeCun Y(1997)Long short-term memory Neural Computation 9 1735-1780
[6]  
Boser B(2012)Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups IEEE Signal Processing Magazine 29 82-97
[7]  
Denker JS(2014)Convolutional Neural Networks for Speech Recognition In IEEE/ACM Trans. on Audio, Speech and Language Processing 22 1533-1545
[8]  
Henderson D(2012)ImageNet Classification with Deep Convolutional Neural Networks Proc. of NIPS 1 1097-1105
[9]  
Howard RE(2011)Natural Language Processing (Almost) from Scratch Journal of Machind Learning Research 12 2493-2537
[10]  
Hubbard W(2007)Statistical parametric speech synthesis Proc. ICASSP 4 1229-1232