The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity

被引:254
作者
Grimm, Dominik G. [1 ,2 ,3 ,4 ]
Azencott, Chloe-Agathe [1 ,2 ,5 ,6 ,7 ]
Aicheler, Fabian [1 ,2 ,3 ]
Gieraths, Udo [1 ,2 ]
MacArthur, Daniel G. [8 ,9 ,10 ]
Samocha, Kaitlin E. [8 ,9 ,10 ]
Cooper, David N. [11 ]
Stenson, Peter D. [11 ]
Daly, Mark J. [8 ,9 ,10 ]
Smoller, Jordan W. [10 ,12 ,13 ]
Duncan, Laramie E. [8 ,9 ,10 ]
Borgwardt, Karsten M. [1 ,2 ,3 ,4 ]
机构
[1] Max Planck Inst Intelligent Syst, Machine Learning & Computat Biol Res Grp, Tubingen, Germany
[2] Max Planck Inst Dev Biol, Tubingen, Germany
[3] Univ Tubingen, Zentrum Bioinformat, Tubingen, Germany
[4] Swiss Fed Inst Technol, Dept Biosyst Sci & Engn, Basel, Switzerland
[5] PLS Res Univ, MINES ParisTech, CBIO Ctr Computat Biol, Fontainebleau, France
[6] Inst Curie, Paris, France
[7] INSERM, Paris, France
[8] Massachusetts Gen Hosp, Analyt & Translat Genet Unit, Boston, MA 02114 USA
[9] Harvard Univ, Sch Med, Dept Med, Boston, MA USA
[10] Broad Inst MIT & Harvard, Cambridge, MA USA
[11] Cardiff Univ, Sch Med, Inst Med Genet, Cardiff CF10 3AX, S Glam, Wales
[12] Massachusetts Gen Hosp, Psychiat & Neurodev Genet Unit, Boston, MA 02114 USA
[13] Harvard Univ, Sch Med, Dept Psychiat, Boston, MA 02115 USA
关键词
pathogenicity prediction tools; exome sequencing; FUNCTIONAL IMPACT; MUTATIONS; DATABASE; IDENTIFICATION; CONSEQUENCES; LIBRARY; SNVS;
D O I
10.1002/humu.22768
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen-2, SIFT, FatHMM, MutationTaster-2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants. We here demonstrate in a study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. We show that comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.
引用
收藏
页码:513 / 523
页数:11
相关论文
共 46 条
[1]   A method and server for predicting damaging missense mutations [J].
Adzhubei, Ivan A. ;
Schmidt, Steffen ;
Peshkin, Leonid ;
Ramensky, Vasily E. ;
Gerasimova, Anna ;
Bork, Peer ;
Kondrashov, Alexey S. ;
Sunyaev, Shamil R. .
NATURE METHODS, 2010, 7 (04) :248-249
[2]  
[Anonymous], UNIPORT KNOWLEDGEBAS
[3]   PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations [J].
Bendl, Jaroslav ;
Stourac, Jan ;
Salanda, Ondrej ;
Pavelka, Antonin ;
Wieben, Eric D. ;
Zendulka, Jaroslav ;
Brezovsky, Jan ;
Damborsky, Jiri .
PLOS COMPUTATIONAL BIOLOGY, 2014, 10 (01)
[4]   BLAST plus : architecture and applications [J].
Camacho, Christiam ;
Coulouris, George ;
Avagyan, Vahram ;
Ma, Ning ;
Papadopoulos, Jason ;
Bealer, Kevin ;
Madden, Thomas L. .
BMC BIOINFORMATICS, 2009, 10
[5]   Collective judgment predicts disease-associated single nucleotide variants [J].
Capriotti, Emidio ;
Altman, Russ B. ;
Bromberg, Yana .
BMC GENOMICS, 2013, 14
[6]   Identification of deleterious mutations within three human genomes [J].
Chun, Sung ;
Fay, Justin C. .
GENOME RESEARCH, 2009, 19 (09) :1553-1561
[7]   Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data [J].
Cooper, Gregory M. ;
Shendure, Jay .
NATURE REVIEWS GENETICS, 2011, 12 (09) :628-640
[8]  
Davis J, 2006, P 23 INT C MACH LEAR, P233, DOI [DOI 10.1145/1143844.1143874, 10.1145/1143844.1143874]
[9]   Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP plus [J].
Davydov, Eugene V. ;
Goode, David L. ;
Sirota, Marina ;
Cooper, Gregory M. ;
Sidow, Arend ;
Batzoglou, Serafim .
PLOS COMPUTATIONAL BIOLOGY, 2010, 6 (12)
[10]  
Fan RE, 2008, J MACH LEARN RES, V9, P1871