Analysis of speaker clustering strategies for HMM-based speech synthesis

被引:0
作者
Dall, Rasmus [1 ]
Veaux, Christophe [1 ]
Yamagishi, Junichi [1 ]
King, Simon [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9YL, Midlothian, Scotland
来源
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3 | 2012年
关键词
Statistical parametric speech synthesis; hidden Markov models; speaker adaptation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes a method for speaker clustering, with the application of building average voice models for speaker-adaptive HMM-based speech synthesis that are a good basis for adapting to specific target speakers. Our main hypothesis is that using perceptually similar speakers to build the average voice model will be better than use unselected speakers, even if the amount of data available from perceptually similar speakers is smaller. We measure the perceived similarities among a group of 30 female speakers in a listening test and then apply multiple linear regression to automatically predict these listener judgements of speaker similarity and thus to identify similar speakers automatically. We then compare a variety of average voice models trained on either speakers who were perceptually judged to be similar to the target speaker, or speakers selected by the multiple linear regression, or a large global set of unselected speakers. We find that the average voice model trained on perceptually similar speakers provides better performance than the global model, even though the latter is trained on more data, confirming our main hypothesis. However, the average voice model using speakers selected automatically by the multiple linear regression does not reach the same level of performance.
引用
收藏
页码:994 / 997
页数:4
相关论文
共 17 条
  • [1] Andraszewicz S, 2011, INT CONF ACOUST SPEE, P5368
  • [2] [Anonymous], 2001, PROC MAVEBA
  • [3] Voice Loudness and Gender Effects on Jitter and Shimmer in Healthy Adults
    Brockmann, Meike
    Storck, Claudio
    Carding, Paul N.
    Drinnan, Michael J.
    [J]. JOURNAL OF SPEECH LANGUAGE AND HEARING RESEARCH, 2008, 51 (05): : 1152 - 1160
  • [4] Vocal Attractiveness Increases by Averaging
    Bruckert, Laetitia
    Bestelmeyer, Patricia
    Latinus, Marianne
    Rouger, Julien
    Charest, Ian
    Rousselet, Guillaume A.
    Kawahara, Hideki
    Belin, Pascal
    [J]. CURRENT BIOLOGY, 2010, 20 (02) : 116 - 120
  • [5] Dejonckere P H, 1996, Rev Laryngol Otol Rhinol (Bord), V117, P219
  • [6] Harmonics-to-noise ratio: An index of vocal aging
    Ferrand, CT
    [J]. JOURNAL OF VOICE, 2002, 16 (04) : 480 - 487
  • [7] Vocal intensity characteristics in normal and elderly speakers
    Hodge, FS
    Colton, RH
    Kelley, RT
    [J]. JOURNAL OF VOICE, 2001, 15 (04) : 503 - 511
  • [8] Ijima Y., 2011, P INTERSPEECH 2011 A, V2011, P2237
  • [9] Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds
    Kawahara, H
    Masuda-Katsuse, I
    de Cheveigné, A
    [J]. SPEECH COMMUNICATION, 1999, 27 (3-4) : 187 - 207
  • [10] The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise
    Lu, Youyi
    Cooke, Martin
    [J]. SPEECH COMMUNICATION, 2009, 51 (12) : 1253 - 1262