High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes

被引:104
作者
Markova-Raina, Penka [1 ]
Petrov, Dmitri [1 ]
机构
[1] Stanford Univ, Dept Biol, Stanford, CA 94305 USA
关键词
MULTIPLE SEQUENCE ALIGNMENT; AMINO-ACID SITES; PHYLOGENETIC ANALYSIS; BRANCH-SITE; EVOLUTION; ACCURACY; GENES; POWER; INSERTIONS/DELETIONS; SUBSTITUTIONS;
D O I
10.1101/gr.115949.110
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We investigate the effect of aligner choice on inferences of positive selection using site-specific models of molecular evolution. We find that independently of the choice of aligner, the rate of false positives is unacceptably high. Our study is a whole-genome analysis of all protein-coding genes in 12 Drosophila genomes annotated in either all 12 species (similar to 6690 genes) or in the six melanogaster group species. We compare six popular aligners: PRANK, T-Coffee, ClustalW, ProbCons, AMAP, and MUSCLE, and find that the aligner choice strongly influences the estimates of positive selection. Differences persist when we use (1) different stringency cutoffs, (2) different selection inference models, (3) alignments with or without gaps, and/or additional masking, (4) per-site versus per-gene statistics, (5) closely related melanogaster group species versus more distant 12 Drosophila genomes. Furthermore, we find that these differences are consequential for downstream analyses such as determination of over/under-represented GO terms associated with positive selection. Visual analysis indicates that most sites inferred as positively selected are, in fact, misaligned at the codon level, resulting in false positive rates of 48%-82%. PRANK, which has been reported to outperform other aligners in simulations, performed best in our empirical study as well. Unfortunately, PRANK still had a high, and unacceptable for most applications, false positives rate of 50%-55%. We identify misannotations and indels, many of which appear to be located in disordered protein regions, as primary culprits for the high misalignment-related error levels and discuss possible workaround approaches to this apparently pervasive problem in genome-wide evolutionary analyses.
引用
收藏
页码:863 / 874
页数:12
相关论文
共 64 条
[41]   Positive selection for indel substitutions in the rodent sperm protein Catsper1 [J].
Podlaha, O ;
Webb, DM ;
Tucker, PK ;
Zhang, JZ .
MOLECULAR BIOLOGY AND EVOLUTION, 2005, 22 (09) :1845-1852
[42]   Widespread discordance of gene trees with species tree in Drosophila:: Evidence for incomplete lineage sorting [J].
Pollard, Daniel A. ;
Iyer, Venky N. ;
Moses, Alan M. ;
Eisen, Michael B. .
PLOS GENETICS, 2006, 2 (10) :1634-1647
[43]  
RODOUT KE, 2010, GENOME BIOL EVOL, P166
[44]   Genome-wide acceleration of protein evolution in flies (Diptera) [J].
Savard, J ;
Tautz, D ;
Lercher, MJ .
BMC EVOLUTIONARY BIOLOGY, 2006, 6 (1)
[45]   Estimates of Positive Darwinian Selection Are Inflated by Errors in Sequencing, Annotation, and Alignment [J].
Schneider, Adrian ;
Souvorov, Alexander ;
Sabath, Niv ;
Landan, Giddy ;
Gonnet, Gaston H. ;
Graur, Dan .
GENOME BIOLOGY AND EVOLUTION, 2009, 1 :114-118
[46]   Positive selection on nucleotide substitutions and indels in accessory gland proteins of the Drosophila pseudoobscura subgroup [J].
Schully, Sheri Dixon ;
Hellberg, Michael E. .
JOURNAL OF MOLECULAR EVOLUTION, 2006, 62 (06) :793-802
[47]   Multiple alignment by sequence annealing [J].
Schwartz, Ariel S. ;
Pachter, Lior .
BIOINFORMATICS, 2007, 23 (02) :E24-E29
[48]   Comparative Genomics on the Drosophila Phylogenetic Tree [J].
Singh, Nadia D. ;
Larracuente, Amanda M. ;
Sackton, Timothy B. ;
Clark, Andrew G. .
ANNUAL REVIEW OF ECOLOGY EVOLUTION AND SYSTEMATICS, 2009, 40 :459-480
[49]   The bioperl toolkit:: Perl modules for the life sciences [J].
Stajich, JE ;
Block, D ;
Boulez, K ;
Brenner, SE ;
Chervitz, SA ;
Dagdigian, C ;
Fuellen, G ;
Gilbert, JGR ;
Korf, I ;
Lapp, H ;
Lehväslaiho, H ;
Matsalla, C ;
Mungall, CJ ;
Osborne, BI ;
Pocock, MR ;
Schattner, P ;
Senger, M ;
Stein, LD ;
Stupka, E ;
Wilkinson, MD ;
Birney, E .
GENOME RESEARCH, 2002, 12 (10) :1611-1618
[50]   A direct approach to false discovery rates [J].
Storey, JD .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2002, 64 :479-498