Improved voicing decision using glottal activity features for statistical parametric speech synthesis

被引:6
作者
Adiga, Nagaraj [1 ]
Khonglah, Banriskhem K. [1 ]
Prasanna, S. R. Mahadeva [1 ]
机构
[1] Indian Inst Technol Guwahati, Dept Elect & Elect Engn, Gauhati 781039, India
关键词
Glottal activity features; Statistical parametric speech synthesis; Voicing decision; Support vector machine; EPOCH EXTRACTION; F0; CLASSIFICATION;
D O I
10.1016/j.dsp.2017.09.007
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A method to improve voicing decision using glottal activity features proposed for statistical parametric speech synthesis. In existing methods, voicing decision relies mostly on fundamental frequency FO, which may result in errors when the prediction is inaccurate. Even though FO is a glottal activity feature, other features that characterize this activity may help in improving the voicing decision. The glottal activity features used in this work are the strength of excitation (SoE), normalized autocorrelation peak strength (NAPS), and higher-order statistics (HOS). These features obtained from approximated source signals like zero-frequency filtered signal and integrated linear prediction residual. To improve voicing decision and to avoid heuristic threshold for classification, glottal activity features are trained using different statistical learning methods such as the k-nearest neighbor, support vector machine (SVM), and deep belief network. The voicing decision works best with SVM classifier, and its effectiveness is tested using the statistical parametric speech synthesis. The glottal activity features SoE, NAPS, and HOS modeled along with FO and Mel-cepstral coefficients in Hidden Markov model and deep neural network to get the voicing decision. The objective and subjective evaluations demonstrate that the proposed method improves the naturalness of synthetic speech. (C) 2017 Elsevier Inc. All rights reserved.
引用
收藏
页码:131 / 143
页数:13
相关论文
共 47 条
[11]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[12]   Voicing detection based on adaptive aperiodicity thresholding for speech enhancement in non-stationary noise [J].
Cabanas-Molero, Pablo ;
Martinez-Munoz, Damian ;
Vera-Candeas, Pedro ;
Ruiz-Reyes, Nicolas ;
Jose Rodriguez-Serrano, Francisco .
IET SIGNAL PROCESSING, 2014, 8 (02) :119-130
[13]  
Camacho A., 2007, THESIS
[14]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[15]  
Cortes C., 2010, Proceedings of the 27th International Conference on Machine Learning, P239
[16]   YIN, a fundamental frequency estimator for speech and music [J].
de Cheveigné, A ;
Kawahara, H .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2002, 111 (04) :1917-1930
[17]   Voiced/Nonvoiced Detection Based on Robustness of Voiced Epochs [J].
Dhananjaya, N. ;
Yegnanarayana, B. .
IEEE SIGNAL PROCESSING LETTERS, 2010, 17 (03) :273-276
[18]   BRANCH AND BOUND ALGORITHM FOR COMPUTING K-NEAREST NEIGHBORS [J].
FUKUNAGA, K ;
NARENDRA, PM .
IEEE TRANSACTIONS ON COMPUTERS, 1975, C 24 (07) :750-753
[19]  
Hinton G. E., 2010, Momentum, P599
[20]  
Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110