Improving accuracy for cancer classification with a new algorithm for genes selection

被引:62
作者
Zhang, Hongyan [1 ,2 ,6 ]
Wang, Haiyan [3 ]
Dai, Zhijun [1 ,2 ]
Chen, Ming-shun [4 ,5 ]
Yuan, Zheming [1 ,2 ]
机构
[1] Hunan Prov Key Lab Crop Germplasm Innovat & Utili, Changsha 410128, Hunan, Peoples R China
[2] Hunan Agr Univ, Coll Biosafety Sci & Technol, Changsha 410128, Hunan, Peoples R China
[3] Kansas State Univ, Dept Stat, Manhattan, KS 66506 USA
[4] Kansas State Univ, USDA ARS, Manhattan, KS 66506 USA
[5] Kansas State Univ, Dept Entomol, Manhattan, KS 66506 USA
[6] Hunan Agr Univ, Coll Informat Sci & Technol, Changsha 410128, Hunan, Peoples R China
基金
高等学校博士学科点专项科研基金;
关键词
RANDOM SUBSPACE METHOD; MICROARRAY DATA; LUNG-CANCER; SVM-RFE; EXPRESSION; TUMOR; PREDICTION; REDUNDANCY; RELEVANCE; DIAGNOSIS;
D O I
10.1186/1471-2105-13-298
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Even though the classification of cancer tissue samples based on gene expression data has advanced considerably in recent years, it faces great challenges to improve accuracy. One of the challenges is to establish an effective method that can select a parsimonious set of relevant genes. So far, most methods for gene selection in literature focus on screening individual or pairs of genes without considering the possible interactions among genes. Here we introduce a new computational method named the Binary Matrix Shuffling Filter (BMSF). It not only overcomes the difficulty associated with the search schemes of traditional wrapper methods and overfitting problem in large dimensional search space but also takes potential gene interactions into account during gene selection. This method, coupled with Support Vector Machine (SVM) for implementation, often selects very small number of genes for easy model interpretability. Results: We applied our method to 9 two-class gene expression datasets involving human cancers. During the gene selection process, the set of genes to be kept in the model was recursively refined and repeatedly updated according to the effect of a given gene on the contributions of other genes in reference to their usefulness in cancer classification. The small number of informative genes selected from each dataset leads to significantly improved leave-one-out (LOOCV) classification accuracy across all 9 datasets for multiple classifiers. Our method also exhibits broad generalization in the genes selected since multiple commonly used classifiers achieved either equivalent or much higher LOOCV accuracy than those reported in literature. Conclusions: Evaluation of a gene's contribution to binary cancer classification is better to be considered after adjusting for the joint effect of a large number of other genes. A computationally efficient search scheme was provided to perform effective search in the extensive feature space that includes possible interactions of many genes. Performance of the algorithm applied to 9 datasets suggests that it is possible to improve the accuracy of cancer classification by a big margin when joint effects of many genes are considered.
引用
收藏
页数:20
相关论文
共 54 条
[1]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[2]   Prognostic gene signatures for non-small-cell lung cancer [J].
Boutros, Paul C. ;
Lau, Suzanne K. ;
Pintilie, Melania ;
Liu, Ni ;
Shepherd, Frances A. ;
Der, Sandy D. ;
Tsao, Ming-Sound ;
Penn, Linda Z. ;
Jurisica, Igor .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (08) :2824-2828
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]   Statistical methods for ranking differentially expressed genes [J].
Broberg, P .
GENOME BIOLOGY, 2003, 4 (06)
[5]   A zyxin-nectin interaction facilitates zyxin localization to cell-cell adhesions [J].
Call, S. Gregory ;
Brereton, Dan ;
Bullard, Jace T. ;
Chung, Jarom Y. ;
Meacham, Kristen L. ;
Morrell, David J. ;
Reeder, David J. ;
Schuler, Jeffrey T. ;
Slade, Austen D. ;
Hansen, Marc D. H. .
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS, 2011, 415 (03) :485-489
[6]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[7]   Improving Cancer Classification Accuracy Using Gene Pairs [J].
Chopra, Pankaj ;
Lee, Jinseung ;
Kang, Jaewoo ;
Lee, Sunwon .
PLOS ONE, 2010, 5 (12)
[8]   POSSIBLE ORDERINGS IN MEASUREMENT SELECTION PROBLEM [J].
COVER, TM ;
VANCAMPENHOUT, JM .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1977, 7 (09) :657-661
[9]   Optimization Based Tumor Classification from Microarray Gene Expression Data [J].
Dagliyan, Onur ;
Uney-Yuksektepe, Fadime ;
Kavakli, I. Halil ;
Turkay, Metin .
PLOS ONE, 2011, 6 (02)
[10]   GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest [J].
Diaz-Uriarte, Ramon .
BMC BIOINFORMATICS, 2007, 8 (1)