Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

被引:106
作者
Ling, Zhen-Hua [1 ]
Deng, Li [2 ]
Yu, Dong [2 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230027, Peoples R China
[2] Microsoft Res, Redmond, WA 98052 USA
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2013年 / 21卷 / 10期
关键词
Deep belief network; hidden Markov model; restricted Boltzmann machine; spectral envelope; speech synthesis; HMM; ALGORITHM; SYSTEM;
D O I
10.1109/TASL.2013.2269291
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter generation criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra.
引用
收藏
页码:2129 / 2139
页数:11
相关论文
共 42 条
[1]  
Abdel-Hamid O, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1332
[2]  
[Anonymous], 1999, P EUROSPEECH
[3]  
[Anonymous], P NIPS 2011 WORKSH D
[4]  
[Anonymous], 2006, P BLIZZ CHALL WORKSH
[5]  
BESAG J, 1986, J R STAT SOC B, V48, P259
[6]  
Chen L.-H., 2013, P INTERSPEECH
[7]   Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].
Dahl, George E. ;
Yu, Dong ;
Deng, Li ;
Acero, Alex .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42
[8]   Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics [J].
Deng, L ;
Ma, J .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2000, 108 (06) :3036-3048
[9]   A STATISTICAL APPROACH TO AUTOMATIC SPEECH RECOGNITION USING THE ATOMIC SPEECH UNITS CONSTRUCTED FROM OVERLAPPING ARTICULATORY FEATURES [J].
DENG, L ;
SUN, DX .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1994, 95 (05) :2702-2719
[10]  
Deng L, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1692