Learning to Count: Robust Estimates for Labeled Distances between Molecular Sequences

被引:79
作者
O'Brien, John D. [1 ]
Minin, Vladimir N. [2 ]
Suchard, Marc A. [1 ,3 ,4 ]
机构
[1] Univ Calif Los Angeles, Dept Biomath, Los Angeles, CA 90024 USA
[2] Univ Washington, Dept Stat, Seattle, WA 98195 USA
[3] Univ Calif Los Angeles, Dept Biostat, Los Angeles, CA USA
[4] Univ Calif Los Angeles, Dept Human Genet, Los Angeles, CA USA
基金
美国国家卫生研究院;
关键词
robust counting; labeled codon distance; empirical distribution; Markov chain substitution model; NONSYNONYMOUS SUBSTITUTION RATES; NUCLEOTIDE SUBSTITUTION; DNA-SEQUENCES; CODING DNA; MODELS; EVOLUTION; SITES; SELECTION; ALIGNMENT; NUMBERS;
D O I
10.1093/molbev/msp003
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Researchers routinely estimate distances between molecular sequences using continuous-time Markov chain models. We present a new method, robust counting, that protects against the possibly severe bias arising from model misspecification. We achieve this robustness by generalizing the conventional distance estimation to incorporate the empirical distribution of site patterns found in the observed pairwise sequence alignment. Our flexible framework allows for computing distances based only on a subset of possible substitutions. From this, we show how to estimate labeled codon distances, such as expected numbers of synonymous or nonsynonymous substitutions. We present two simulation studies. The first compares the relative bias and variance of conventional and robust labeled nucleotide estimators. In the second simulation, we demonstrate that robust counting furnishes accurate synonymous and nonsynonymous distance estimates based only on easy-to-fit models of nucleotide substitution, bypassing the need for computationally expensive codon models. We conclude with three empirical examples. In the first two examples, we investigate the evolutionary dynamics of the influenza A hemagglutinin gene using labeled codon distances. In the final example, we demonstrate the advantages of using robust synonymous distances to alleviate the effect of convergent evolution on phylogenetic analysis of an HIV transmission network.
引用
收藏
页码:801 / 814
页数:14
相关论文
共 52 条
[1]   Simple derivations of properties of counting processes associated with Markov renewal processes [J].
Ball, F ;
Milne, RK .
JOURNAL OF APPLIED PROBABILITY, 2005, 42 (04) :1031-1043
[2]   FLAN: a web server for influenza virus genome annotation [J].
Bao, Yiming ;
Bolotov, Pavel ;
Dernovoy, Dmitry ;
Kiryutin, Boris ;
Tatusova, Tatiana .
NUCLEIC ACIDS RESEARCH, 2007, 35 :W280-W284
[3]  
Bierne N, 2003, GENETICS, V165, P1587
[4]   Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli [J].
Blount, Zachary D. ;
Borland, Christina Z. ;
Lenski, Richard E. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2008, 105 (23) :7899-7906
[5]   Exploring among-site rate variation models in a maximum likelihood framework using empirical data: Effects of model assumptions on estimates of topology, branch lengths, and bootstrap support [J].
Buckley, TR ;
Simon, C ;
Chambers, GK .
SYSTEMATIC BIOLOGY, 2001, 50 (01) :67-86
[6]   Predicting the evolution of human influenza A [J].
Bush, RM ;
Bender, CA ;
Subbarao, K ;
Cox, NJ ;
Fitch, WM .
SCIENCE, 1999, 286 (5446) :1921-1925
[7]   The Jalview Java']Java alignment editor [J].
Clamp, M ;
Cuff, J ;
Searle, SM ;
Barton, GJ .
BIOINFORMATICS, 2004, 20 (03) :426-427
[8]   The molecular epidemiology of influenza viruses [J].
Cox, NJ ;
Bender, CA .
SEMINARS IN VIROLOGY, 1995, 6 (06) :359-370
[9]   CONVERGENT EVOLUTION - THE NEED TO BE EXPLICIT [J].
DOOLITTLE, RF .
TRENDS IN BIOCHEMICAL SCIENCES, 1994, 19 (01) :15-18
[10]   A combined empirical and mechanistic codon model [J].
Doron-Faigenboim, Adi ;
Pupko, Tal .
MOLECULAR BIOLOGY AND EVOLUTION, 2007, 24 (02) :388-397