PROSODY GENERATION USING FRAME-BASED GAUSSIAN PROCESS REGRESSION AND CLASSIFICATION FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

被引:0
作者
Koriyama, Tomoki [1 ]
Kobayashi, Takao [1 ]
机构
[1] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Tokyo, Japan
来源
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP) | 2015年
关键词
Statistical parametric speech synthesis; prosody; Gaussian process regression; Gaussian process classification; kernel function; RECOGNITION; EXTRACTION;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes novel models of F0 contours and phone durations using Gaussian process regression and classification (GPR and GPC) for statistical parametric speech synthesis. Although the use of frame-based GPR has shown the effectiveness of spectral feature modeling in previous studies, the application of GPR to prosodic features, i.e., F0 and phone duration, was not investigated sufficiently because the kernel function was designed for phonetic information only. In this paper, therefore, we propose a kernel function available for multiple units such as syllables, moras, and accent phrases. The proposed kernel function is based on temporal acoustic events like the beginning of accent phrase and the relative position between the target frame and the event is utilized for the kernel function. Experimental results of objective and subjective tests show that the GPR/GPC-based F0 and duration modeling improves the prediction accuracy of acoustic features compared with HMM-based speech synthesis.
引用
收藏
页码:4929 / 4933
页数:5
相关论文
共 15 条
[1]  
[Anonymous], 1999, P EUROSPEECH
[2]  
Fernandez R, 2013, INT CONF ACOUST SPEE, P6885, DOI 10.1109/ICASSP.2013.6638996
[3]  
Fukuda T, 2004, IEICE T INF SYST, VE87D, P1110
[4]  
Hirose K., 1984, P ICASSP, V9, P597, DOI [DOI 10.1109/ICASSP.1984.11, 10.1109/ICASSP.1984.1172814]
[5]  
Kang SY, 2013, INT CONF ACOUST SPEE, P8012, DOI 10.1109/ICASSP.2013.6639225
[6]   Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].
Kawahara, H ;
Masuda-Katsuse, I ;
de Cheveigné, A .
SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207
[7]  
Koriyama T., 2014, P ICASSP, P3862
[8]  
Koriyama T., 2013, P INTERSPEECH, P1072
[9]   Statistical Parametric Speech Synthesis Based on Gaussian Process Regression [J].
Koriyama, Tomoki ;
Nose, Takashi ;
Kobayashi, Takao .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2014, 8 (02) :173-183
[10]   ATR JAPANESE SPEECH DATABASE AS A TOOL OF SPEECH RECOGNITION AND SYNTHESIS [J].
KUREMATSU, A ;
TAKEDA, K ;
SAGISAKA, Y ;
KATAGIRI, S ;
KUWABARA, H ;
SHIKANO, K .
SPEECH COMMUNICATION, 1990, 9 (04) :357-363