Identifying Mendelian disease genes with the Variant Effect Scoring Tool

被引:356
作者
Carter, Hannah [1 ,2 ]
Douville, Christopher [1 ,2 ]
Stenson, Peter D. [3 ]
Cooper, David N. [3 ]
Karchin, Rachel [1 ,2 ]
机构
[1] Johns Hopkins Univ, Dept Biomed Engn, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Inst Computat Med, Baltimore, MD USA
[3] Cardiff Univ, Inst Med Genet, Sch Med, Cardiff CF14 4XN, S Glam, Wales
来源
BMC GENOMICS | 2013年 / 14卷
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
PROTEIN FUNCTION; GENOME BROWSER; MUTATIONS; DATABASE; POLYMORPHISMS; SUBSTITUTIONS; ASSOCIATION; DISCOVERY; FRAMEWORK;
D O I
10.1186/1471-2164-14-S3-S3
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Whole exome sequencing studies identify hundreds to thousands of rare protein coding variants of ambiguous significance for human health. Computational tools are needed to accelerate the identification of specific variants and genes that contribute to human disease. Results: We have developed the Variant Effect Scoring Tool (VEST), a supervised machine learning-based classifier, to prioritize rare missense variants with likely involvement in human disease. The VEST classifier training set comprised similar to 45,000 disease mutations from the latest Human Gene Mutation Database release and another similar to 45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project. VEST outperforms some of the most popular methods for prioritizing missense variants in carefully designed holdout benchmarking experiments (VEST ROC AUC = 0.91, PolyPhen2 ROC AUC = 0.86, SIFT4.0 ROC AUC = 0.84). VEST estimates variant score p-values against a null distribution of VEST scores for neutral variants not included in the VEST training set. These p-values can be aggregated at the gene level across multiple disease exomes to rank genes for probable disease involvement. We tested the ability of an aggregate VEST gene score to identify candidate Mendelian disease genes, based on whole-exome sequencing of a small number of disease cases. We used whole-exome data for two Mendelian disorders for which the causal gene is known. Considering only genes that contained variants in all cases, the VEST gene score ranked dihydroorotate dehydrogenase (DHODH) number 2 of 2253 genes in four cases of Miller syndrome, and myosin-3 (MYH3) number 2 of 2313 genes in three cases of Freeman Sheldon syndrome. Conclusions: Our results demonstrate the potential power gain of aggregating bioinformatics variant scores into gene-level scores and the general utility of bioinformatics in assisting the search for disease genes in large-scale exome sequencing studies. VEST is available as a stand-alone software package at http://wiki.chasmsoftware.org and is hosted by the CRAVAT web server at http://www.cravat.us
引用
收藏
页数:16
相关论文
共 50 条
  • [1] A method and server for predicting damaging missense mutations
    Adzhubei, Ivan A.
    Schmidt, Steffen
    Peshkin, Leonid
    Ramensky, Vasily E.
    Gerasimova, Anna
    Bork, Peer
    Kondrashov, Alexey S.
    Sunyaev, Shamil R.
    [J]. NATURE METHODS, 2010, 7 (04) : 248 - 249
  • [2] Dindel: Accurate indel calls from short-read data
    Albers, Cornelis A.
    Lunter, Gerton
    MacArthur, Daniel G.
    McVean, Gilean
    Ouwehand, Willem H.
    Durbin, Richard
    [J]. GENOME RESEARCH, 2011, 21 (06) : 961 - 973
  • [3] A map of human genome variation from population-scale sequencing
    Altshuler, David
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Collins, Francis S.
    De la Vega, Francisco M.
    Donnelly, Peter
    Egholm, Michael
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Knoppers, Bartha M.
    Lander, Eric S.
    Lehrach, Hans
    Mardis, Elaine R.
    McVean, Gil A.
    Nickerson, DebbieA.
    Peltonen, Leena
    Schafer, Alan J.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Deiros, David
    Metzker, Mike
    Muzny, Donna
    Reid, Jeff
    Wheeler, David
    Wang, Jun
    Li, Jingxiang
    Jian, Min
    Li, Guoqing
    Li, Ruiqiang
    Liang, Huiqing
    Tian, Geng
    Wang, Bo
    Wang, Jian
    Wang, Wei
    Yang, Huanming
    Zhang, Xiuqing
    Zheng, Huisong
    Lander, Eric S.
    Altshuler, David L.
    Ambrogio, Lauren
    Bloom, Toby
    Cibulskis, Kristian
    Fennell, Tim J.
    Gabriel, Stacey B.
    [J]. NATURE, 2010, 467 (7319) : 1061 - 1073
  • [4] Shape quantization and recognition with randomized trees
    Amit, Y
    Geman, D
    [J]. NEURAL COMPUTATION, 1997, 9 (07) : 1545 - 1588
  • [5] [Anonymous], R LANG ENV STAT COMP
  • [6] [Anonymous], 2004, Mach. Learn.
  • [7] BamTools: a C++ API and toolkit for analyzing and managing BAM files
    Barnett, Derek W.
    Garrison, Erik K.
    Quinlan, Aaron R.
    Stroemberg, Michael P.
    Marth, Gabor T.
    [J]. BIOINFORMATICS, 2011, 27 (12) : 1691 - 1692
  • [8] USING MUTUAL INFORMATION FOR SELECTING FEATURES IN SUPERVISED NEURAL-NET LEARNING
    BATTITI, R
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (04): : 537 - 550
  • [9] CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING
    BENJAMINI, Y
    HOCHBERG, Y
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) : 289 - 300
  • [10] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32