The contribution of various sources of spectral mismatch to audible discontinuities in a diphone database

被引:6
作者
Klabbers, Esther [1 ]
van Santen, Jan P. H. [1 ]
Kain, Alexander [1 ]
机构
[1] Oregon Hlth & Sci Univ, Ctr Spoken Language Understanding, OGI Sch Sci & Engn, Beaverton, OR 97206 USA
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2007年 / 15卷 / 03期
基金
美国国家科学基金会;
关键词
audible discontinuities; diphones; spectral distance measures; speech synthesis;
D O I
10.1109/TASL.2006.885250
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
One of the major problems in concatenative synthesis is the occurrence of audible discontinuities between two successive concatenative units. Several studies have attempted to discover objective distance measures that predict the audibility of these discontinuities. In this paper, we investigate mid-vowel joins for three vowels with a range of post-vocalic consonant contexts typical for diphone databases. A first perceptual experiment uses a pair wise comparison procedure to find two subsets of unit combinations: Those with versus without audible discontinuities. A second perceptual experiment uses these two subsets in a procedure where formant resynthesis is used to manipulate three sources of discontinuity separately: formant frequencies, formant bandwidths, and overall energy. Results show mismatch in formant frequencies provides the largest contribution to audible discontinuity, followed by mismatch in overall energy.
引用
收藏
页码:949 / 956
页数:8
相关论文
共 25 条
[1]  
ALLEN J, 1987, TEXT TO SPEECH MITAL
[2]  
[Anonymous], P ICSLP 98 SYND
[3]  
BALESTRI M, 1999, P 6 EUR C SPEECH COM, P2291
[4]  
DONOVAN R, 2001, P 4 ISCA SPEECH SYNT, P123
[5]   Critical values for the robust rank-order test [J].
Feltovich, N .
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2005, 34 (03) :525-547
[6]  
HAYS WL, 1988, STATISTICS, P418
[7]  
Hieronymus JL., 1994, ASCII phonetic symbols for the world's languages
[8]  
Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110
[9]   Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].
Kawahara, H ;
Masuda-Katsuse, I ;
de Cheveigné, A .
SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207
[10]   Reducing audible spectral discontinuities [J].
Klabbers, E ;
Veldhuis, R .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2001, 9 (01) :39-51