Trainable unit selection speech synthesis under statistical framework

被引：4

作者：

Wang RenHua ^{[1
]}

Dai LiRong ^{[1
]}

Ling ZhenHua ^{[1
]}

Hu Yu ^{[1
]}

机构：

[1] Univ Sci & Technol China, iFLYTEK Speech Lab, Hefei 230027, Peoples R China

来源：

CHINESE SCIENCE BULLETIN | 2009年 / 54卷 / 11期

基金：

中国国家自然科学基金;

关键词：

speech synthesis; unit selection and waveform concatenation; statistical modeling; maximum likelihood criterion;

D O I：

10.1007/s11434-009-0267-3

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

This paper proposes a trainable unit selection speech synthesis method based on statistical modeling framework. At training stage, acoustic features are extracted from the training database and statistical models are estimated for each feature. During synthesis, the optimal candidate unit sequence is searched out from the database following the maximum likelihood criterion derived from the trained models. Finally, the waveforms of the optimal candidate units are concatenated to produce synthetic speech. Experiment results show that this method can improve the automation of system construction and naturalness of synthetic speech effectively compared with the conventional unit selection synthesis method. Furthermore, this paper presents a minimum unit selection error model training criterion according to the characteristics of unit selection speech synthesis and adopts discriminative training for model parameter estimation. This criterion can finally achieve the full automation of system construction and improve the naturalness of synthetic speech further.

引用

页码：1963 / 1969

页数：7

共 15 条

[1]

[Anonymous], 1999, P EUROSPEECH

[2] MULTIDIMENSIONAL STOCHASTIC APPROXIMATION METHODS [J].

BLUM, JR .

ANNALS OF MATHEMATICAL STATISTICS, 1954, 25 (04) :737-744

[3]

DONOVAN R, 1996, THESIS CAMBRIDGE U C

[4]

FUKUDA T, 1992, P IEEE INT C AC SPEE, P137

[5]

HIRAI T, 2004, P 5 ISCA SPEECH SYNT, P37

[6]

Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110

[7] Minimum classification error rate methods for speech recognition [J].

Juang, BH ;

Chou, W ;

Lee, CH .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1997, 5 (03) :257-265

[8]

Ling ZH, 2008, INT CONF ACOUST SPEE, P3949

[9]

Ling ZH, 2007, INT CONF ACOUST SPEE, P1245

[10]

Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, P79, DOI 10.1250/ast.21.79

← 1 2 →