GPR-based Thai speech synthesis using multi-level duration prediction

被引:2
作者
Moungsri, Decha [1 ]
Koriyama, Tomoki [2 ]
Kobayashi, Takao [2 ]
机构
[1] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan
[2] Tokyo Inst Technol, Sch Engn, Yokohama, Kanagawa 2268502, Japan
关键词
Thai language; Speech synthesis; Gaussian process regression; Multi-level model; Prosody; Duration prediction; NEURAL-NETWORKS; REGRESSION;
D O I
10.1016/j.specom.2018.03.005
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a multi-level Gaussian process regression (GPR)-based method for duration prediction by incorporating phone- and syllable-level duration models. In this method, we first train the syllable model and predict syllable durations for a given input of context labels. Then, we use the predicted syllable duration as an additional context for the phone-level model to predict phone durations. To apply multi-level duration prediction to the GPR-based speech synthesis framework, we designed phone- and syllable- level context sets for Thai that include linguistic information and the relative positions of speech units. We also examined the multilevel deep neural network (DNN)-based duration-prediction method by using the same approach as for the proposed multi-level GPR-based one. We conducted objective and subjective evaluations using two-hour training data to compare the proposed method with single-level ones. The results indicate that the proposed multi-level duration-prediction method outperformed single-level ones in DNN-, and GPR-based frameworks. They also indicate that the proposed multi-level GPR-based method can provide better performance than the multi-level HMM-based duration-prediction method.
引用
收藏
页码:114 / 123
页数:10
相关论文
共 43 条
  • [1] [Anonymous], AC SPEECH SIGN PROC
  • [2] [Anonymous], P EUR C SPEECH COMM
  • [3] [Anonymous], 1998, P ICSLP SYDN AUSTR
  • [4] Campbell Nick., 1992, Bailly et alii, P211
  • [5] Campbell W.Nick., 1993, EUROSPEECH, V93, P1081
  • [6] SEGMENT DURATIONS IN A SYLLABLE FRAME
    CAMPBELL, WN
    ISARD, SD
    [J]. JOURNAL OF PHONETICS, 1991, 19 (01) : 37 - 47
  • [7] A new duration modeling approach for Mandarin speech
    Chen, SH
    Lai, WH
    Wang, YR
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2003, 11 (04): : 308 - 320
  • [8] Chomphan S., 2007, P 18 ANN C INT SPEEC, P2849
  • [9] CHOMPHAN S, 2007, P 6 ISCA WORKSH SPEE, P160
  • [10] Tone correctness improvement in speaker dependent HMM-based Thai speech synthesis
    Chomphan, Suphattharachai
    Kobayashi, Takao
    [J]. SPEECH COMMUNICATION, 2008, 50 (05) : 392 - 404