Expectation-Maximization enables Phylogenetic Dating under a Categorical Rate Model

被引:0
作者
Mai, Uyen [1 ]
Charvel, Eduardo [2 ]
Mirarab, Siavash [3 ]
机构
[1] Univ Calif San Diego, Dept Comp Sci & Engn, La Jolla, CA 92093 USA
[2] Univ Calif San Diego, Bioinformat & Syst Biol Grad Program, La Jolla, CA 92093 USA
[3] Univ Calif San Diego, Dept Elect & Comp Engn, La Jolla, CA 92093 USA
基金
美国国家卫生研究院;
关键词
Categorical model; Expectation-Maximization algorithm; molecular dating; phylogenetic dating; time tree; ESTIMATING DIVERGENCE TIMES; MOLECULAR EVOLUTION; MAXIMUM-LIKELIHOOD; HIV-1; CLOCKS; AGE;
D O I
10.1093/sysbio/syae034
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Dating phylogenetic trees to obtain branch lengths in time units is essential for many downstream applications but has remained challenging. Dating requires inferring substitution rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a distribution of branch rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification, where the assumed parametric statistical distribution of branch rates vastly differs from the true distribution. Notably, most existing methods assume rigid, often unimodal, branch rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates, often leading to difficult non-convex optimization problems. To tackle both challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation-Maximization algorithm to co-estimate rate categories and branch lengths in time units. Our model has fewer assumptions about the true distribution of branch rates than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with exponential or multimodal rate distributions.
引用
收藏
页码:823 / 838
页数:16
相关论文
共 55 条
[1]   Quantifying Differences in the Tempo of Human Immunodeficiency Virus Type 1 Subtype Evolution [J].
Abecasis, Ana B. ;
Vandamme, Anne-Mieke ;
Lemey, Philippe .
JOURNAL OF VIROLOGY, 2009, 83 (24) :12917-12924
[2]   The substitution rate of HIV-1 subtypes: a genomic approach [J].
Angel Patino-Galindo, Juana ;
Gonzalez-Candelas, Fernando .
VIRUS EVOLUTION, 2017, 3 (02)
[3]   Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA Phylogeny [J].
Aris-Brosou, S ;
Yang, ZH .
SYSTEMATIC BIOLOGY, 2002, 51 (05) :703-714
[4]   Heterogeneous Rates of Molecular Evolution and Diversification Could Explain the Triassic Age Estimate for Angiosperms [J].
Beaulieu, Jeremy M. ;
O'Meara, Brian C. ;
Crane, Peter ;
Donoghue, Michael J. .
SYSTEMATIC BIOLOGY, 2015, 64 (05) :869-878
[5]   A site- and time-heterogeneous model of amino acid replacement [J].
Blanquart, Samuel ;
Lartillot, Nicolas .
MOLECULAR BIOLOGY AND EVOLUTION, 2008, 25 (05) :842-858
[6]   Estimating divergence times in large phylogenetic trees [J].
Britton, Tom ;
Anderson, Cajsa Lisa ;
Jacquet, David ;
Lundqvist, Samuel ;
Bremer, Kare .
SYSTEMATIC BIOLOGY, 2007, 56 (05) :741-752
[7]   The modern molecular clock [J].
Bromham, L ;
Penny, D .
NATURE REVIEWS GENETICS, 2003, 4 (03) :216-224
[8]   Establishing a time-scale for plant evolution [J].
Clarke, John T. ;
Warnock, Rachel C. M. ;
Donoghue, Philip C. J. .
NEW PHYTOLOGIST, 2011, 192 (01) :266-301
[9]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[10]   Bayesian inference of ancestral dates on bacterial phylogenetic trees [J].
Didelot, Xavier ;
Croucher, Nicholas J. ;
Bentley, Stephen D. ;
Harris, Simon R. ;
Wilson, Daniel J. .
NUCLEIC ACIDS RESEARCH, 2018, 46 (22)