Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression

被引:35
作者
Ling, Zhen-Hua [1 ]
Richmond, Korin [2 ]
Yamagishi, Junichi [2 ]
机构
[1] Univ Sci & Technol China, iFLYTEK Speech Lab, Hefei 230027, Peoples R China
[2] Univ Edinburgh, CSTR, Edinburgh EH8 9AB, Midlothian, Scotland
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2013年 / 21卷 / 01期
基金
中国国家自然科学基金; 英国工程与自然科学研究理事会;
关键词
Articulatory features; Gaussian mixture model; multiple-regression hidden Markov model; speech synthesis; MOVEMENTS; ADAPTATION; EXTRACTION; TRACKING;
D O I
10.1109/TASL.2012.2215600
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.
引用
收藏
页码:205 / 217
页数:13
相关论文
共 34 条
  • [1] Extraction and tracking of the tongue surface from ultrasound image sequences
    Akgul, YS
    Kambhamettu, C
    Stone, M
    [J]. 1998 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, 1998, : 298 - 303
  • [2] [Anonymous], 1999, P EUROSPEECH
  • [3] [Anonymous], BLIZZ CHALL WORKSH
  • [4] BAER T, 1987, Magnetic Resonance Imaging, V5, P1, DOI 10.1016/0730-725X(87)90477-2
  • [5] Model-Based Reproduction of Articulatory Trajectories for Consonant-Vowel Sequences
    Birkholz, Peter
    Kroeger, Bernd J.
    Neuschaefer-Rube, Christiane
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (05): : 1422 - 1433
  • [6] Black AW, 2012, INT CONF ACOUST SPEE, P4005, DOI 10.1109/ICASSP.2012.6288796
  • [7] Cai M.-Q., 2012, P 11 INT C IN PRESS
  • [8] Cetin O., 2003, PROC EUROPEAN C SPEE, P2517
  • [9] Deng L., 2004, P ICSLP, V2004, P719
  • [10] Fujinaga K, 2001, INT CONF ACOUST SPEE, P513, DOI 10.1109/ICASSP.2001.940880