A POLYNOMIAL SEGMENT MODEL BASED STATISTICAL PARAMETRIC SPEECH SYNTHESIS SYSTEM

被引:0
作者
Sun, Jingwei [1 ]
Ding, Feng [1 ]
Wu, Yahui [1 ]
机构
[1] Nokia Res, Beijing, Peoples R China
来源
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS | 2009年
关键词
Hidden Markov Model; Polynomial Segment Model; statistical parametric speech synthesis; mean trajectory;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we present a statistical parametric speech synthesis system based on the polynomial segment model (PSM). As one of the segmental models for speech signals, PSM explicitly describes the trajectory of the features in a speech segment, and keeps the internal dynamics of the segment. In this work, spectral and excitation parameters are modeled by PSMs simultaneously, while the duration for each segment is modeled by a single Gaussian distribution. A top-down K-means clustering technique is applied for model tying. Mean trajectories acquired from PSMs are used directly to generate speech parameters according to the estimated segment duration. An English speech synthesizer back-end is implemented on CMU Arctic corpus and the performance of the new approach is compared with the classical HMM-based one. Experimental results show that PSM modeling can achieve similar naturalness and intelligence of the synthetic speech as HMM modeling. The system is in the early stage of its development.
引用
收藏
页码:4021 / 4024
页数:4
相关论文
共 50 条
[41]   Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis [J].
Ai, Yang ;
Ling, Zhen-Hua ;
Wu, Wei-Lu ;
Li, Ang .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :2036-2048
[42]   COMPARING GLOTTAL-FLOW-EXCITED STATISTICAL PARAMETRIC SPEECH SYNTHESIS METHODS [J].
Raitio, Tuomo ;
Suni, Antti ;
Vainio, Martti ;
Alku, Paavo .
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, :7830-7834
[43]   Phase perception of the glottal excitation and its relevance in statistical parametric speech synthesis [J].
Raitio, Tuomo ;
Juvela, Lauri ;
Suni, Antti ;
Vainio, Martti ;
Alku, Paavo .
SPEECH COMMUNICATION, 2016, 81 :104-119
[44]   A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis [J].
Airaksinen, Manu ;
Juvela, Lauri ;
Bollepalli, Bajibabu ;
Yamagishi, Junichi ;
Alku, Paavo .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) :1658-1670
[45]   EFFECT OF MPEG AUDIO COMPRESSION ON VOCODERS USED IN STATISTICAL PARAMETRIC SPEECH SYNTHESIS [J].
Bollepalli, Bajibabu ;
Raito, Tuomo .
2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, :1237-1241
[46]   DEEP MIXTURE DENSITY NETWORKS FOR ACOUSTIC MODELING IN STATISTICAL PARAMETRIC SPEECH SYNTHESIS [J].
Zen, Heiga ;
Senior, Andrew .
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[47]   MULTI-CLASS LEARNING ALGORITHM FOR DEEP NEURAL NETWORK-BASED STATISTICAL PARAMETRIC SPEECH SYNTHESIS [J].
Song, Eunwoo ;
Kang, Hong-Goo .
2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2016, :1951-1955
[48]   PROSODY GENERATION USING FRAME-BASED GAUSSIAN PROCESS REGRESSION AND CLASSIFICATION FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS [J].
Koriyama, Tomoki ;
Kobayashi, Takao .
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, :4929-4933
[49]   Voice quality control using perceptual expressions for statistical parametric speech synthesis based on cluster adaptive training [J].
Ohtani, Yamato ;
Mori, Koichiro ;
Morita, Masahiro .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2258-2262
[50]   Recursive likelihood evaluation and fast search algorithm for polynomial segment model, with application to speech recognition [J].
Li, Chak-Fai ;
Siu, Man-Hung ;
Au-Yeung, Jeff Siu-Kei .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05) :1704-1718