Synthesis of prosody using multi-level unit sequences

被引:12
作者
van Santen, J [1 ]
Kain, A [1 ]
Klabbers, E [1 ]
Mishra, T [1 ]
机构
[1] Oregon Hlth & Sci Univ, OGI Sch Sci & Engn, Ctr Spoken Language Understanding, Portland, OR 97201 USA
关键词
D O I
10.1016/j.specom.2005.01.008
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Generating meaningful and natural sounding prosody is a central challenge in text-to-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In unit selection synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. This paper describes new methods that are being explored, based on the principle of super-positional prosody transplant. Both methods are based on the following procedure. In a recorded, prosodically and phonemically labeled corpus, the log pitch contours are additively decomposed into component curves according to a prosodic hierarchy, typically phrase curves (corresponding to phrases), accent curves (corresponding to feet), and segmental perturbation (or residuals) curves. During synthesis, the corpus is searched for multiple unit sequences: A unit sequence that covers the target phoneme string, and one or more unit sequences that cover the prosodic labels at a given phonological level (e.g., the foot or phrase) and are constrained by being matched to the phone match sequence in terms of the phonetic classes of the phonemes (or in terms of higher level entities, such as the number of feet and their sizes measured in syllables). The methods differ in terms of the level of detail of these constraints. A superpositional prosody transplant procedure generates a target pitch contour by extracting and recombining component curves from these sequences, and imposing this contour on the sequence that matches the phone string using standard speech modification methods. This process minimizes prosodic modification artifacts, optimizes the naturalness of the target pitch contour, yet avoids the combinatorial explosion of standard unit selection synthesis. (c) 2005 Elsevier B.V. All rights reserved.
引用
收藏
页码:365 / 375
页数:11
相关论文
共 32 条
  • [1] [Anonymous], 1997, INTRO TEXT SPEECH SY
  • [2] BAAIJEN D, 2000, WORD FREQUENCY DISTR
  • [3] BOUZON C, 2004, P SPEECH PROS 2004 N
  • [4] Charpentier F, 1989, P EUROSPEECH 89, P13
  • [5] DODGE Y, 1981, ANAL EXPT MISSING DA
  • [6] FUJISAKI H, 1983, PRODUCTION SPEECH, P39
  • [7] FUJISAKI H, 1988, VOCAL PHYSL VOICE PR, P347
  • [8] KLABBERS E, 2002, WORKSH SPEECH SYNTH
  • [9] Klabbers E., 2004, P 5 ISCA SPEECH SYNT
  • [10] KLABBERS E, 2003, P EUR 2003 GEN SWITZ