Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses

被引:74
作者
Bayzid, Md Shamsuzzoha [1 ]
Mirarab, Siavash [1 ]
Boussau, Bastien [2 ]
Warnow, Tandy [3 ]
机构
[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA
[2] Univ Lyons, Lab Biometrie & Biol Evolut, Lyon, France
[3] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
来源
PLOS ONE | 2015年 / 10卷 / 06期
基金
美国国家科学基金会;
关键词
INFERRING SPECIES TREES; GENE TREES; SEQUENCE ALIGNMENTS; MAXIMUM-LIKELIHOOD; BAYESIAN-INFERENCE; PHYLOGENOMICS;
D O I
10.1371/journal.pone.0129183
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9. figshare. 1411146, and the software is available at https://github.com/smirarab/binning.
引用
收藏
页数:40
相关论文
共 55 条
  • [1] [Anonymous], 2013, Journal of Phylogenetics and Evolutionary Biology
  • [2] Naive binning improves phylogenomic analyses
    Bayzid, Md Shamsuzzoha
    Warnow, Tandy
    [J]. BIOINFORMATICS, 2013, 29 (18) : 2277 - 2284
  • [3] NEW METHODS TO COLOR THE VERTICES OF A GRAPH
    BRELAZ, D
    [J]. COMMUNICATIONS OF THE ACM, 1979, 22 (04) : 251 - 256
  • [4] Dasarathy G, 2014, ARXIV14047055
  • [5] Dasarathy G, 2014, IEEE INT SYMP INFO, P2037, DOI 10.1109/ISIT.2014.6875191
  • [6] Robustness to Divergence Time Underestimation When Inferring Species Trees from Estimated Gene Trees
    DeGiorgio, Michael
    Degnan, James H.
    [J]. SYSTEMATIC BIOLOGY, 2014, 63 (01) : 66 - 82
  • [7] Fast and Consistent Estimation of Species Trees Using Supermatrix Rooted Triples
    DeGiorgio, Michael
    Degnan, James H.
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2010, 27 (03) : 552 - 569
  • [8] Properties of Consensus Methods for Inferring Species Trees from Gene Trees
    Degnan, James H.
    DeGiorgio, Michael
    Bryant, David
    Rosenberg, Noah A.
    [J]. SYSTEMATIC BIOLOGY, 2009, 58 (01) : 35 - 54
  • [9] Gene tree discordance, phylogenetic inference and the multispecies coalescent
    Degnan, James H.
    Rosenberg, Noah A.
    [J]. TRENDS IN ECOLOGY & EVOLUTION, 2009, 24 (06) : 332 - 340
  • [10] Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs
    Dutheil, Julien
    Boussau, Bastien
    [J]. BMC EVOLUTIONARY BIOLOGY, 2008, 8 (1)