Bayesian networks for phone duration prediction

被引:23
作者
Goubanova, Olga [1 ]
King, Simon [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9LW, Midlothian, Scotland
基金
英国工程与自然科学研究理事会;
关键词
text-to-speech; Bayesian networks; duration modelling; sums of products; classification and regression trees;
D O I
10.1016/j.specom.2007.10.002
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In a text-to-speech system, the duration of each phone may be predicted by a duration model. This model is usually trained using a database of phones with known durations; each phone (and the context it appears in) is characterised by a feature vector that is composed of a set of linguistic factor values. We describe the use of a graphical model - a Bayesian network - for predicting the duration of a phone, given the values for these factors. The network has one discrete variable for each of the linguistic factors and a single continuous variable for the phone's duration. Dependencies between variables (or the lack of them) are represented in the BN structure by arcs (or missing arcs) between pairs of nodes. During training, both the topology of the network and its parameters are learned from labelled data. We compare the results of the BN model with results for sums of products and CART models on the same data. In terms of the root mean square error, the BN model performs much better than both CART and SoP models. In terms of correlation coefficient, the BN model performs better than the SoP model, and as well as the CART model. A BN model has certain advantages over CART and SoP models. Training SoP models requires a high degree of expertise. CART models do not deal with interactions between factors in any explicit way. As we demonstrate, a BN model can also make accurate predictions of a phone's duration, even when the values for some of the linguistic factors are unknown. (c) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:301 / 311
页数:11
相关论文
共 58 条
[1]  
[Anonymous], 1988, PROBABILISTIC REASON, DOI DOI 10.1016/C2009-0-27609-4
[2]  
[Anonymous], P 5 ISCA WORKSH SPEE
[3]  
[Anonymous], 1987, From Text to Speech: the MITalk System
[4]  
[Anonymous], J PHONETICS
[5]   CHARACTERIZATION OF RHYTHMIC PATTERNS FOR TEXT-TO-SPEECH SYNTHESIS [J].
BARBOSA, P ;
BAILLY, G .
SPEECH COMMUNICATION, 1994, 15 (1-2) :127-137
[6]   Effects of disfluencies, predictability, and utterance position on word form variation in English conversation [J].
Bell, A ;
Jurafsky, D ;
Fosler-Lussier, E ;
Girand, C ;
Gregory, M ;
Gildea, D .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2003, 113 (02) :1001-1024
[7]  
Bishop C., 1998, NEURAL NETWORKS PATT
[8]  
BLACK A, 2003, 120 U ED CTR SPEECH
[9]  
BOUTILIER C, 1996, P 12 C UNC ART INT U
[10]  
CAMPBELL N, 1992, P 2 INT C SPOK LANG