IMPROVED MODELING FOR F0 GENERATION AND V/U DECISION IN HMM-BASED TTS

被引：15

作者：

Zhang, Qingqing ^{[1
,2
]}

Soong, Frank ^{[1
]}

Qian, Yao ^{[1
]}

Yan, Zhijie ^{[1
]}

Pan, Jielin ^{[2
]}

Yan, Yonghong ^{[2
]}

机构：

[1] Microsoft Res Asia, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Acoust, ThinkIT Speech Lab, Beijing 100864, Peoples R China

来源：

2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2010年

关键词：

V/U decision model; F0; generation; voicing strength; HMM-based TTS;

D O I：

10.1109/ICASSP.2010.5495561

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The HMM-based TTS can produce a highly intelligible and decent quality voice. However, sometimes the synthesized speech exhibits perceptibly annoying glitches due to F0 extraction errors in the training data and voiced/unvoiced swapping errors in F0 generation. In the conventional MSD based F0 modeling [10], the dual but incompatible two probabilistic spaces, the continuous probability density for voiced observations or the discrete probability for unvoiced observations, prevent us from using likelihood based frame occupancy to alleviate the deteriorating effect of F0 extraction errors in training a more robust model for synthesis. In this paper, we propose a new approach to improved modeling the piece-wise continuous F0 trajectory and v/u decision for HMM-based TTS. Voicing strength, characterized by the normalized correlation coefficient magnitude calculated in F0 feature extraction, is used as an additional feature in F0 modeling and for v/u decision. Experimental results show the new approach to F0 modeling and generation outperforms MSD-HMM method and a newly proposed GTD-HMM method [9] significantly. The improvements are both objectively measurable and subjectively perceivable.

引用

页码：4606 / 4609

页数：4

共 15 条

[1]

[Anonymous], APREPITANT INTERVIEW

[2]

Arifianto D, 2004, IEICE T INF SYST, VE87D, P2812

[3]

Chen C.J., 1997, P EUR 1997, P1543

[4]

Kang S., 2009, P INTSPEECH2009 BRIG

[5] Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].

Kawahara, H ;

Masuda-Katsuse, I ;

de Cheveigné, A .

SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207

[6]

Kawahara H., 1999, P EUROSPEECH

[7]

Masuko T., 2000, Transactions of the Institute of Electronics, Information and Communication Engineers D-II, VJ83D-II, P1600

[8]

Qian Y., 2009, P INTERSPEECH 2009

[9]

Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, P79, DOI 10.1250/ast.21.79

[10]

Talkin D., 1995, Speech coding and synthesis, V495, P518

← 1 2 →