Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis

被引:0
作者
Al-Radhi, Mohammed Salah [1 ]
Csapo, Tamas Gabor [1 ,2 ]
Nemeth, Geza [1 ]
机构
[1] Budapest Univ Technol & Econ, Dept Telecommun & Media Informat, H-1117 Budapest, Hungary
[2] Hungarian Acad Sci, MTA ELTE Lendulet Lingual Articulat Res Grp, H-1088 Budapest, Hungary
来源
APPLIED SCIENCES-BASEL | 2019年 / 9卷 / 12期
关键词
continuous F0; speech synthesis; Kalman filter; time-warping; HNR; PLUS NOISE MODEL; INTELLIGIBILITY; EXCITATION; HMM;
D O I
10.3390/app9122460
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Recent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch estimate; one that interpolates fundamental frequency (F0) even when voicing is not present. However, continuous F0 is still sensitive to additive noise in speech signals and suffers from short-term errors (when it changes rather quickly over time). To alleviate these issues, three adaptive techniques have been developed in this article for achieving a robust and accurate F0: (1) we weight the pitch estimates with state noise covariance using adaptive Kalman-filter framework, (2) we iteratively apply a time axis warping on the input frame signal, (3) we optimize all F0 candidates using an instantaneous-frequency-based approach. Additionally, the second goal of this study is to introduce an extension of a novel continuous-based speech synthesis system (i.e., in which all parameters are continuous). We propose adding a new excitation parameter named Harmonic-to-Noise Ratio (HNR) to the voiced and unvoiced components to indicate the degree of voicing in the excitation and to reduce the influence of buzziness caused by the vocoder. Results based on objective and perceptual tests demonstrate that the voice built with the proposed framework gives state-of-the-art speech synthesis performance while outperforming the previous baseline.
引用
收藏
页数:23
相关论文
共 73 条
[1]  
Abe T., 1997, Proceedings of the International Symposium on Simulation, Visualization and Auralization for Acoustic Research and Education, P423
[2]  
Agiomyrgiannakis Y, 2015, INT CONF ACOUST SPEE, P4230, DOI 10.1109/ICASSP.2015.7178768
[3]   Time-domain envelope modulating the noise component of excitation in a continuous residual-based vocoder for statistical parametric speech synthesis [J].
Al-Radhi, Mohammed Salah ;
Csapo, Tamas Gabor ;
Nemeth, Geza .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :434-438
[4]  
[Anonymous], P INT C LEARN REPR I
[5]  
[Anonymous], P IEEE INT C AC SPEE
[6]  
[Anonymous], P I PHON SCI U AMST
[7]  
[Anonymous], P EUR SIGN PROC C EU
[8]  
[Anonymous], P 8 ISCA SPEECH SYNT
[9]  
[Anonymous], P IEEE AC SPEECH SIG
[10]  
[Anonymous], P 23 EUR SIGN PROC C