HMM adaptation and voice conversion for the synthesis of child speech: a comparison

被引：0

作者：

Watts, Oliver ^{[1
]}

Yamagishi, Junichi ^{[1
]}

King, Simon ^{[1
]}

Berkling, Kay ^{[2
]}

机构：

[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9YL, Midlothian, Scotland

[2] Inline Internet Online Dienste GmbH, Karlsruhe, Germany

来源：

INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5 | 2009年

基金：

英国工程与自然科学研究理事会;

关键词：

child speech; statistical parametric speech synthesis; HMM-based speech synthesis; voice conversion; HTS; Average Voice Models; Festival;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This study compares two different methodologies for producing data-driven synthesis of child speech from existing systems that have been trained on the speech of adults. On one hand, an existing statistical parametric synthesiser is transformed using model adaptation techniques, informed by linguistic and prosodic knowledge, to the speaker characteristics of a child speaker. This is compared with the application of voice conversion techniques to convert the output of an existing waveform concatenation synthesiser with no explicit linguistic or prosodic knowledge. In a subjective evaluation of the similarity of synthetic speech to natural speech from the target speaker, the HMM-based systems evaluated are generally preferred, although this is at least in part due to the higher dimensional acoustic features supported by these techniques.

引用

页码：2595 / +

页数：2

共 13 条

[1]

Black A.W., 1999, The Festival Speech Synthesis System: system documentation

[2]

Black AlanW., 2007, BUILDING SYNTHETIC V

[3]

Boersma P., 2016, PRAAT DOING PHONETIC

[4] Multisyn: Open-domain unit selection for the Festival speech synthesis system [J].

Clark, Robert A. J. ;

Richmond, Korin ;

King, Simon .

SPEECH COMMUNICATION, 2007, 49 (04) :317-330

[5] Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].

Kawahara, H ;

Masuda-Katsuse, I ;

de Cheveigné, A .

SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207

[6]

KOMINEK J, 2004, P ISCA SSW5

[7] Continuous probabilistic transform for voice conversion [J].

Stylianou, Y ;

Cappe, O ;

Moulines, E .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (02) :131-142

[8] Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory [J].

Toda, Tomoki ;

Black, Alan W. ;

Tokuda, Keiichi .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (08) :2222-2235

[9]

TOTH AR, 2008, INTERSPEECH 2008

[10]

Watts Oliver, 2008, P 1 WORKSH CHILD COM

← 1 2 →