Unit-centric feature mapping for inventory pruning in unit selection text-to-speech synthesis

被引:4
作者
Bellegarda, Jerome R. [1 ]
机构
[1] Apple Inc, Speech & Language Technol, Cupertino, CA 95014 USA
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2008年 / 16卷 / 01期
关键词
concatenative speech synthesis; inventory pruning; outlier removal; unit redundancy perception; unit selection;
D O I
10.1109/TASL.2007.911059
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The level of quality that can be attained in concatenative text-to-speech (TTS) synthesis is primarily governed by the inventory of units used in unit selection. This has led to the collection of ever larger corpora in the quest for ever more natural synthetic speech. As operational considerations limit the size of the unit inventory, however, pruning is critical to removing any instances that prove either spurious or superfluous. This paper proposes a novel pruning strategy based on a data-driven feature extraction framework separately optimized for each unit type in the inventory. A single distinctiveness/redundancy measure can then address, in a consistent manner, the two different problems of outliers and redundant units. Detailed analysis of an illustrative case study exemplifies the typical behavior of the resulting unit pruning procedure, and listening evidence suggests that both moderate and aggressive inventory pruning can be achieved with minimal degradation in perceived TTS quality. These experiments underscore the benefits of unit-centric feature mapping for database optimization in concatenative synthesis.
引用
收藏
页码:74 / 82
页数:9
相关论文
共 26 条
[1]  
[Anonymous], 1997, Eurospeech97
[2]  
[Anonymous], 2002, P ICSLP
[3]  
BALESTRI M, 1999, P 6 EUR C SPEECH COM, P2291
[4]  
Bellegarda JR, 2007, INT CONF ACOUST SPEE, P521
[5]   Globally optimal training of unit boundaries in unit selection text-to-speech synthesis [J].
Bellegarda, Jerome R. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (03) :957-965
[6]   Exploiting latent semantic information in statistical language modeling [J].
Bellegarda, JR .
PROCEEDINGS OF THE IEEE, 2000, 88 (08) :1279-1296
[7]   A global, boundary-centric framework for unit selection text-to-speech synthesis [J].
Bellegarda, JR .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (03) :990-997
[8]   Statistical prosodic modeling: From corpus design to parameter estimation [J].
Bellegarda, JR ;
Silverman, KEA ;
Lenzo, K ;
Anderson, V .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2001, 9 (01) :52-66
[9]  
BELLEGARDA JR, 2004, P 5 ISCA SPEECH SYNT, P133
[10]  
BELLEGARDA JR, 2005, SIGNAL PROCESS MAG S, V22