Analysis of speaker clustering strategies for HMM-based speech synthesis

被引：0

作者：

Dall, Rasmus ^{[1
]}

Veaux, Christophe ^{[1
]}

Yamagishi, Junichi ^{[1
]}

King, Simon ^{[1
]}

机构：

[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9YL, Midlothian, Scotland

来源：

13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3 | 2012年

关键词：

Statistical parametric speech synthesis; hidden Markov models; speaker adaptation;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper describes a method for speaker clustering, with the application of building average voice models for speaker-adaptive HMM-based speech synthesis that are a good basis for adapting to specific target speakers. Our main hypothesis is that using perceptually similar speakers to build the average voice model will be better than use unselected speakers, even if the amount of data available from perceptually similar speakers is smaller. We measure the perceived similarities among a group of 30 female speakers in a listening test and then apply multiple linear regression to automatically predict these listener judgements of speaker similarity and thus to identify similar speakers automatically. We then compare a variety of average voice models trained on either speakers who were perceptually judged to be similar to the target speaker, or speakers selected by the multiple linear regression, or a large global set of unselected speakers. We find that the average voice model trained on perceptually similar speakers provides better performance than the global model, even though the latter is trained on more data, confirming our main hypothesis. However, the average voice model using speakers selected automatically by the multiple linear regression does not reach the same level of performance.

引用

页码：994 / 997

页数：4

共 17 条

[1] Andraszewicz S, 2011, INT CONF ACOUST SPEE, P5368
[2] [Anonymous], 2001, PROC MAVEBA
[3] Voice Loudness and Gender Effects on Jitter and Shimmer in Healthy Adults
Brockmann, Meike
Storck, Claudio
Carding, Paul N.
Drinnan, Michael J.
[J]. JOURNAL OF SPEECH LANGUAGE AND HEARING RESEARCH, 2008, 51 (05): : 1152 - 1160
[4] Vocal Attractiveness Increases by Averaging
Bruckert, Laetitia
Bestelmeyer, Patricia
Latinus, Marianne
Rouger, Julien
Charest, Ian
Rousselet, Guillaume A.
Kawahara, Hideki
Belin, Pascal
[J]. CURRENT BIOLOGY, 2010, 20 (02) : 116 - 120
[5] Dejonckere P H, 1996, Rev Laryngol Otol Rhinol (Bord), V117, P219
[6] Harmonics-to-noise ratio: An index of vocal aging
Ferrand, CT
[J]. JOURNAL OF VOICE, 2002, 16 (04) : 480 - 487
[7] Vocal intensity characteristics in normal and elderly speakers
Hodge, FS
Colton, RH
Kelley, RT
[J]. JOURNAL OF VOICE, 2001, 15 (04) : 503 - 511
[8] Ijima Y., 2011, P INTERSPEECH 2011 A, V2011, P2237
[9] Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds
Kawahara, H
Masuda-Katsuse, I
de Cheveigné, A
[J]. SPEECH COMMUNICATION, 1999, 27 (3-4) : 187 - 207
[10] The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise
Lu, Youyi
Cooke, Martin
[J]. SPEECH COMMUNICATION, 2009, 51 (12) : 1253 - 1262

← 1 2 →