Improved voicing decision using glottal activity features for statistical parametric speech synthesis

被引：6

作者：

Adiga, Nagaraj ^{[1
]}

Khonglah, Banriskhem K. ^{[1
]}

Prasanna, S. R. Mahadeva ^{[1
]}

机构：

[1] Indian Inst Technol Guwahati, Dept Elect & Elect Engn, Gauhati 781039, India

来源：

DIGITAL SIGNAL PROCESSING | 2017年 / 71卷

关键词：

Glottal activity features; Statistical parametric speech synthesis; Voicing decision; Support vector machine; EPOCH EXTRACTION; F0; CLASSIFICATION;

D O I：

10.1016/j.dsp.2017.09.007

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

A method to improve voicing decision using glottal activity features proposed for statistical parametric speech synthesis. In existing methods, voicing decision relies mostly on fundamental frequency FO, which may result in errors when the prediction is inaccurate. Even though FO is a glottal activity feature, other features that characterize this activity may help in improving the voicing decision. The glottal activity features used in this work are the strength of excitation (SoE), normalized autocorrelation peak strength (NAPS), and higher-order statistics (HOS). These features obtained from approximated source signals like zero-frequency filtered signal and integrated linear prediction residual. To improve voicing decision and to avoid heuristic threshold for classification, glottal activity features are trained using different statistical learning methods such as the k-nearest neighbor, support vector machine (SVM), and deep belief network. The voicing decision works best with SVM classifier, and its effectiveness is tested using the statistical parametric speech synthesis. The glottal activity features SoE, NAPS, and HOS modeled along with FO and Mel-cepstral coefficients in Hidden Markov model and deep neural network to get the voicing decision. The objective and subjective evaluations demonstrate that the proposed method improves the naturalness of synthetic speech. (C) 2017 Elsevier Inc. All rights reserved.

引用

页码：131 / 143

页数：13

共 47 条

[21]

Janer L, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1209, DOI 10.1109/ICSLP.1996.607825

[22]

Kang S., 2009, P INTERSPEECH, P412

[23] Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].

Kawahara, H ;

Masuda-Katsuse, I ;

de Cheveigné, A .

SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207

[24] Speech/music classification using speech-specific features [J].

Khonglah, Banriskhem K. ;

Prasanna, S. R. Mahadeva .

DIGITAL SIGNAL PROCESSING, 2016, 48 :71-83

[25] An introduction to statistical parametric speech synthesis [J].

King, Simon .

SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2011, 36 (05) :837-852

[26] Enhancement of noisy speech by temporal and spectral processing [J].

Krishnamoorthy, P. ;

Prasanna, S. R. M. .

SPEECH COMMUNICATION, 2011, 53 (02) :154-174

[27] 2-CHANNEL SPEECH ANALYSIS [J].

KRISHNAMURTHY, AK ;

CHILDERS, DG .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1986, 34 (04) :730-743

[28] Epoch Extraction From Speech Signals [J].

Murty, K. Sri Rama ;

Yegnanarayana, B. .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (08) :1602-1613

[29] Characterization of Glottal Activity From Speech Signals [J].

Murty, K. Sri Rama ;

Yegnanarayana, B. ;

Joseph, M. Anand .

IEEE SIGNAL PROCESSING LETTERS, 2009, 16 (06) :469-472

[30]

Narendra N., 2015, CIRCUITS SYST SIGNAL, P1

← 1 2 3 4 5 →