Formant-based Frequency Warping for Improving Speaker Adaptation in HMM TTS

被引:0
作者
Zhuang, Xin [1 ,2 ]
Qian, Yao [1 ]
Soong, Frank [1 ]
Wu, Yijian [3 ]
Zhang, Bo [2 ]
机构
[1] Microsoft Res Asia, Beijing, Peoples R China
[2] Nankai Univ, Coll Software, Tianjin, Peoples R China
[3] Microsoft China, Beijing, Peoples R China
来源
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2 | 2010年
关键词
speech synthesis; speaker adaptation; frequency warping; SPEECH SYNTHESIS; VOICE CONVERSION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vocal Tract Length Normalization (VLTN), usually implemented as a frequency warping procedure (e.g. bilinear transformation), has been used successfully to adapt the spectral characteristics to a target speaker in speech recognition. In this study we exploit the same concept of frequency warping but concentrate explicitly on mapping the first four formant frequencies of 5 long vowels from source and target speakers. A universal warping function is thus constructed for improving MLLR-based speaker adaptation performance in TTS. The function first warps the frequency scale of the source speaker's speech data toward that of the target speaker and an HMM of the warped features is trained. Finally, MLLR-based speaker adaptation is applied to the trained HMM for synthesizing the target speaker's speech. When tested on a database of 4,000 sentences (source speaker) and 100 sentences of a male and a female speaker (target speakers), the formant based frequency warping has been found very effective in reducing the objective, log spectral distortion over the system without formant frequency warping. The improvement is also subjectively confirmed in AB preference and ABX speaker similarity listening tests.
引用
收藏
页码:817 / +
页数:2
相关论文
共 13 条
[1]  
[Anonymous], CMUCS97148
[2]   Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].
Kawahara, H ;
Masuda-Katsuse, I ;
de Cheveigné, A .
SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207
[3]  
Saheer Lakshmi, 2010, P ICASSP
[4]  
Shuang Z.-W., 2006, P TC STAR WORKSH SPE
[5]  
SOONG FK, 1984, P ICASSP
[6]   Continuous probabilistic transform for voice conversion [J].
Stylianou, Y ;
Cappe, O ;
Moulines, E .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (02) :131-142
[7]  
Tamura M, 2001, INT CONF ACOUST SPEE, P805, DOI 10.1109/ICASSP.2001.941037
[8]  
Toda T., 2009, P NAT C MAN MACH SPE, P492
[9]   Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory [J].
Toda, Tomoki ;
Black, Alan W. ;
Tokuda, Keiichi .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (08) :2222-2235
[10]   Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training [J].
Yamagishi, Junichi ;
Kobayashi, Takao .
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2007, E90D (02) :533-543