DEEP BELIEF NETWORK-BASED POST-FILTERING FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

被引：0

作者：

Hu, Ya-Jun ^{[1
]}

Ling, Zhen-Hua ^{[1
]}

Dai, Li-Rong ^{[1
]}

机构：

[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Peoples R China

来源：

2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS | 2016年

关键词：

speech synthesis; hidden Markov model; post-filter; deep belief network; restricted Boltzmann machine; ALGORITHM;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The speech synthesized by statistical parametric speech synthesis (SPSS) always sounds muffled. One important reason is that the generated spectral envelopes are over-smoothed and many detailed spectral structures in natural speech are lost. This paper presents a deep belief network (DBN)-based post-filtering method for hidden Markov model (HMM)-based SPSS to address this issue. At training time, a DBN is estimated using the spectral envelopes extracted from natural speech. This DBN serves as a generatively trained post-filter which processes the spectral envelopes recovered from the predicted spectral features at synthesis time. Experimental results show that the effectiveness of this method depends on the sampling strategy used to generate the training data of the restricted Boltzmann machines (RBM) which forms the higher layers of the DBN. When binary samples are adopted instead of mean-filed approximation, the DBN post-filter can alleviate the over-smoothing effect of parameter generation and improve the naturalness of synthetic speech significantly when either mel-cepstra or line spectral pairs (LSP) are used as spectral features. Its performance is comparative with the parameter generation method with global variance (GV) modeling for melcepstra and better than the LSP-based formant enhancement method used in previous work.

引用

页码：5510 / 5514

页数：5

共 18 条

[1]

[Anonymous], 1986, P 1986 PARALLEL DIST

[2]

[Anonymous], 2001, 15341 BS ITUR

[3]

[Anonymous], 1999, P EUROSPEECH

[4]

[Anonymous], BLIZZ CHALL WORKSH

[5] Learning Deep Architectures for AI [J].

Bengio, Yoshua .

FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01) :1-127

[6]

Buchholz S, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P3060

[7] A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis [J].

Chen, Ling-Hui ;

Raitio, Tuomo ;

Valentini-Botinhao, Cassia ;

Ling, Zhen-Hua ;

Yamagishi, Junichi .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (11) :2003-2014

[8]

Deng L, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1692

[9] Reducing the dimensionality of data with neural networks [J].

Hinton, G. E. ;

Salakhutdinov, R. R. .

SCIENCE, 2006, 313 (5786) :504-507

[10]

Hinton G. E., 2010, Momentum, P599

← 1 2 →