GPR-based Thai speech synthesis using multi-level duration prediction

被引：2

作者：

Moungsri, Decha ^{[1
]}

Koriyama, Tomoki ^{[2
]}

Kobayashi, Takao ^{[2
]}

机构：

[1] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan

[2] Tokyo Inst Technol, Sch Engn, Yokohama, Kanagawa 2268502, Japan

来源：

SPEECH COMMUNICATION | 2018年 / 99卷

关键词：

Thai language; Speech synthesis; Gaussian process regression; Multi-level model; Prosody; Duration prediction; NEURAL-NETWORKS; REGRESSION;

D O I：

10.1016/j.specom.2018.03.005

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes a multi-level Gaussian process regression (GPR)-based method for duration prediction by incorporating phone- and syllable-level duration models. In this method, we first train the syllable model and predict syllable durations for a given input of context labels. Then, we use the predicted syllable duration as an additional context for the phone-level model to predict phone durations. To apply multi-level duration prediction to the GPR-based speech synthesis framework, we designed phone- and syllable- level context sets for Thai that include linguistic information and the relative positions of speech units. We also examined the multilevel deep neural network (DNN)-based duration-prediction method by using the same approach as for the proposed multi-level GPR-based one. We conducted objective and subjective evaluations using two-hour training data to compare the proposed method with single-level ones. The results indicate that the proposed multi-level duration-prediction method outperformed single-level ones in DNN-, and GPR-based frameworks. They also indicate that the proposed multi-level GPR-based method can provide better performance than the multi-level HMM-based duration-prediction method.

引用

页码：114 / 123

页数：10

共 43 条

[1] [Anonymous], AC SPEECH SIGN PROC
[2] [Anonymous], P EUR C SPEECH COMM
[3] [Anonymous], 1998, P ICSLP SYDN AUSTR
[4] Campbell Nick., 1992, Bailly et alii, P211
[5] Campbell W.Nick., 1993, EUROSPEECH, V93, P1081
[6] SEGMENT DURATIONS IN A SYLLABLE FRAME
CAMPBELL, WN
ISARD, SD
[J]. JOURNAL OF PHONETICS, 1991, 19 (01) : 37 - 47
[7] A new duration modeling approach for Mandarin speech
Chen, SH
Lai, WH
Wang, YR
[J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2003, 11 (04): : 308 - 320
[8] Chomphan S., 2007, P 18 ANN C INT SPEEC, P2849
[9] CHOMPHAN S, 2007, P 6 ISCA WORKSH SPEE, P160
[10] Tone correctness improvement in speaker dependent HMM-based Thai speech synthesis
Chomphan, Suphattharachai
Kobayashi, Takao
[J]. SPEECH COMMUNICATION, 2008, 50 (05) : 392 - 404

← 1 2 3 4 5 →