A new disease-specific machine learning approach for the prediction of cancer-causing missense variants

被引:66
作者
Capriotti, Emidio [1 ,3 ]
Altman, Russ B. [2 ]
机构
[1] Stanford Univ, Dept Bioengn, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Genet, Stanford, CA 94305 USA
[3] Univ Balearic Isl, Dept Math & Comp Sci, Palma De Mallorca, Spain
关键词
Single Nucleotide Polymorphisms; Cancer-causing variants; Gene Ontology; Machine-learning; Support Vector Machine; SINGLE-NUCLEOTIDE POLYMORPHISMS; NON-SYNONYMOUS SNPS; PROTEIN MUTATIONS; SOMATIC MUTATIONS; GENE ONTOLOGY; HUMAN BREAST; ANNOTATION; SEQUENCE; DATABASE; TOOL;
D O I
10.1016/j.ygeno.2011.06.010
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
High-throughput genotyping and sequencing techniques are rapidly and inexpensively providing large amounts of human genetic variation data. Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability and have been implicated in several human diseases, including cancer. Amino acid mutations resulting from non-synonymous SNPs in coding regions may generate protein functional changes that affect cell proliferation. In this study, we developed a machine learning approach to predict cancer-causing missense variants. We present a Support Vector Machine (SVM) classifier trained on a set of 3163 cancer-causing variants and an equal number of neutral polymorphisms. The method achieve 93% overall accuracy, a correlation coefficient of 0.86, and area under ROC curve of 0.98. When compared with other previously developed algorithms such as SIFT and CHASM our method results in higher prediction accuracy and correlation coefficient in identifying cancer-causing variants. (C) 2011 Elsevier Inc. All rights reserved.
引用
收藏
页码:310 / 317
页数:8
相关论文
共 53 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[3]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[4]   Assessing the accuracy of prediction algorithms for classification: an overview [J].
Baldi, P ;
Brunak, S ;
Chauvin, Y ;
Andersen, CAF ;
Nielsen, H .
BIOINFORMATICS, 2000, 16 (05) :412-424
[5]   Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information [J].
Bao, L ;
Cui, Y .
BIOINFORMATICS, 2005, 21 (10) :2185-2190
[6]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[7]   Over-optimism in bioinformatics research [J].
Boulesteix, Anne-Laure .
BIOINFORMATICS, 2010, 26 (03) :437-439
[8]   GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes [J].
Boyle, EI ;
Weng, SA ;
Gollub, J ;
Jin, H ;
Botstein, D ;
Cherry, JM ;
Sherlock, G .
BIOINFORMATICS, 2004, 20 (18) :3710-3715
[9]   SNAP predicts effect of mutations on protein function [J].
Bromberg, Yana ;
Yachdav, Guy ;
Rost, Burkhard .
BIOINFORMATICS, 2008, 24 (20) :2397-2398
[10]   Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls [J].
Burton, Paul R. ;
Clayton, David G. ;
Cardon, Lon R. ;
Craddock, Nick ;
Deloukas, Panos ;
Duncanson, Audrey ;
Kwiatkowski, Dominic P. ;
McCarthy, Mark I. ;
Ouwehand, Willem H. ;
Samani, Nilesh J. ;
Todd, John A. ;
Donnelly, Peter ;
Barrett, Jeffrey C. ;
Davison, Dan ;
Easton, Doug ;
Evans, David ;
Leung, Hin-Tak ;
Marchini, Jonathan L. ;
Morris, Andrew P. ;
Spencer, Chris C. A. ;
Tobin, Martin D. ;
Attwood, Antony P. ;
Boorman, James P. ;
Cant, Barbara ;
Everson, Ursula ;
Hussey, Judith M. ;
Jolley, Jennifer D. ;
Knight, Alexandra S. ;
Koch, Kerstin ;
Meech, Elizabeth ;
Nutland, Sarah ;
Prowse, Christopher V. ;
Stevens, Helen E. ;
Taylor, Niall C. ;
Walters, Graham R. ;
Walker, Neil M. ;
Watkins, Nicholas A. ;
Winzer, Thilo ;
Jones, Richard W. ;
McArdle, Wendy L. ;
Ring, Susan M. ;
Strachan, David P. ;
Pembrey, Marcus ;
Breen, Gerome ;
St Clair, David ;
Caesar, Sian ;
Gordon-Smith, Katherine ;
Jones, Lisa ;
Fraser, Christine ;
Green, Elain K. .
NATURE, 2007, 447 (7145) :661-678