Measuring the fit of sequence data to phylogenetic model: Allowing for missing data

被引:14
作者
Waddell, PJ [1 ]
机构
[1] Univ S Carolina, Dept Biol Sci, Dept Stat, Columbia, SC 29208 USA
关键词
phylogenetic likelihood-ratio test; model fit; ML with unequal site rates; G statistic; Hadamard conjugation; spectral analysis;
D O I
10.1093/molbev/msi002
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
It is fundamentally important to assess the fit of data to model in phylogenetic and evolutionary studies. Phylogenetic methods using molecular sequences typically start with a multiple alignment. It is possible to measure the fit of data to model expectations of data, for example, via the likelihood-ratio (G) test or the X-2 test, if all sites in all sequences have an unambiguous residue. However, nearly all alignments of interest contain sites (columns of the alignment) with missing data, that is, ambiguous nucleotides, gaps, or unsequenced regions, which must presently be removed before using the above tests. Unfortunately, this is often either undesirable or impractical, as it will discard much of the data. Here, we show how iterative ML estimators may directly estimate the site-pattern probabilities for columns with missing data, given only standard i.i.d. assumptions. The optimization may use an EM or Newton algorithm, or any other hill-climbing approach. The resulting optimal likelihood under the unconstrained or multinomial model may be compared directly with the likelihood of the data coming from the model (a G statistic). Alternatively the modified observed and the expected frequencies of site patterns may be compared using a X-2 test. The distribution of such statistics is best assessed using appropriate simulations. The new method is applicable to models using codons or paired sites. The methods are also useful with Hadamard conjugations (spectral analysis) and are illustrated with these and with ML evolutionary models that allow site-rate variability.
引用
收藏
页码:395 / 401
页数:7
相关论文
共 27 条
[1]  
ADACHI J, 1999, MOLPHY VERSION 2 3 M
[2]  
[Anonymous], 1971, STAT DECISION THEORY
[3]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[4]   CASES IN WHICH PARSIMONY OR COMPATIBILITY METHODS WILL BE POSITIVELY MISLEADING [J].
FELSENSTEIN, J .
SYSTEMATIC ZOOLOGY, 1978, 27 (04) :401-410
[5]   EVOLUTIONARY TREES FROM DNA-SEQUENCES - A MAXIMUM-LIKELIHOOD APPROACH [J].
FELSENSTEIN, J .
JOURNAL OF MOLECULAR EVOLUTION, 1981, 17 (06) :368-376
[6]   STATISTICAL TESTS OF MODELS OF DNA SUBSTITUTION [J].
GOLDMAN, N .
JOURNAL OF MOLECULAR EVOLUTION, 1993, 36 (02) :182-198
[7]   DATING OF THE HUMAN APE SPLITTING BY A MOLECULAR CLOCK OF MITOCHONDRIAL-DNA [J].
HASEGAWA, M ;
KISHINO, H ;
YANO, TA .
JOURNAL OF MOLECULAR EVOLUTION, 1985, 22 (02) :160-174
[8]   A DISCRETE FOURIER-ANALYSIS FOR EVOLUTIONARY TREES [J].
HENDY, MD ;
PENNY, D ;
STEEL, MA .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1994, 91 (08) :3339-3343
[9]   SPECTRAL-ANALYSIS OF PHYLOGENETIC DATA [J].
HENDY, MD .
JOURNAL OF CLASSIFICATION, 1993, 10 (01) :5-24
[10]   A FRAMEWORK FOR THE QUANTITATIVE STUDY OF EVOLUTIONARY TREES [J].
HENDY, MD ;
PENNY, D .
SYSTEMATIC ZOOLOGY, 1989, 38 (04) :297-309