An improved minimum generation error based model adaptation for HMM-based speech synthesis

被引:0
作者
Wu, Yi-Jian [1 ]
Qin, Long [2 ]
Tokuda, Keiichi [1 ]
机构
[1] Nagoya Inst Technol, Nagoya, Aichi, Japan
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA USA
来源
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5 | 2009年
关键词
Speech synthesis; HMM; speaker adaptation; minimum generation error; linear regression;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A minimum generation error (MGE) criterion had been proposed for model training in HMM-based speech synthesis. In this paper, we apply the MGE criterion to model adaptation for HMM-based speech synthesis, and introduce an MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models are optimized so as to minimize the generation errors of adaptation data. In addition, we incorporate the recent improvements of MGE criterion into MGELR-based model adaptation, including state alignment under MGE criterion and using a log spectral distortion (LSD) instead of Euclidean distance for spectral distortion measure. From the experimental results, the adaptation performance was improved after incorporating these two techniques, and the formal listening tests showed that the quality and speaker similarity of synthesized speech after MGELR-based adaptation were significantly improved over the original MLLR-based adaptation.
引用
收藏
页码:1727 / +
页数:2
相关论文
共 17 条
[1]   A THEORY OF ADAPTIVE PATTERN CLASSIFIERS [J].
AMARI, S .
IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS, 1967, EC16 (03) :299-+
[2]  
[Anonymous], P EUR SIM MOD SPECTR
[3]   Maximum likelihood linear transformations for HMM-based speech recognition [J].
Gales, MJF .
COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) :75-98
[4]   LINE SPECTRUM REPRESENTATION OF LINEAR PREDICTOR COEFFICIENTS OF SPEECH SIGNALS [J].
ITAKURA, F .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1975, 57 :S35-S35
[5]   Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].
Kawahara, H ;
Masuda-Katsuse, I ;
de Cheveigné, A .
SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207
[6]  
Kominek J., 2003, CMULTI03177
[7]   MAXIMUM-LIKELIHOOD LINEAR-REGRESSION FOR SPEAKER ADAPTATION OF CONTINUOUS DENSITY HIDDEN MARKOV-MODELS [J].
LEGGETTER, CJ ;
WOODLAND, PC .
COMPUTER SPEECH AND LANGUAGE, 1995, 9 (02) :171-185
[8]  
Masuko T, 1996, INT CONF ACOUST SPEE, P389, DOI 10.1109/ICASSP.1996.541114
[9]  
MASUKO T, 1998, 3 ESCA COCOSDA WORKS, P273
[10]  
QIN L, 2008, P ICASSP MAR, P3953