Iterative feature removal yields highly discriminative pathways

被引:15
作者
O'Hara, Stephen [1 ]
Wang, Kun [1 ,7 ]
Slayden, Richard A. [2 ]
Schenkel, Alan R. [2 ]
Huber, Greg [3 ]
O'Hern, Corey S. [4 ,5 ]
Shattuck, Mark D. [6 ]
Kirby, Michael [1 ]
机构
[1] Colorado State Univ, Dept Math, Ft Collins, CO 80523 USA
[2] Colorado State Univ, Dept Microbiol Immunol & Pathol, Ft Collins, CO 80523 USA
[3] Univ Calif Santa Barbara, Kavli Inst Theoret Phys, Santa Barbara, CA 93106 USA
[4] Yale Univ, Dept Appl Phys, Dept Mech Engn & Mat Sci, New Haven, CT 06520 USA
[5] Yale Univ, Dept Phys, New Haven, CT USA
[6] CUNY City Coll, Dept Phys, New York, NY 10031 USA
[7] Yale Univ, Dept Mech Engn & Mat Sci, New Haven, CT USA
基金
美国国家科学基金会;
关键词
Feature selection; Microarray; Discrimination; Classification; Pathways; Sparse SVM; Influenza; FEATURE-SELECTION; EXPRESSION PROFILES; MICROARRAY DATA; GENE SELECTION; BREAST-CANCER; CLASSIFICATION; SIGNATURES; RELEVANCE;
D O I
10.1186/1471-2164-14-832
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: We introduce Iterative Feature Removal (IFR) as an unbiased approach for selecting features with diagnostic capacity from large data sets. The algorithm is based on recently developed tools in machine learning that are driven by sparse feature selection goals. When applied to genomic data, our method is designed to identify genes that can provide deeper insight into complex interactions while remaining directly connected to diagnostic utility. We contrast this approach with the search for a minimal best set of discriminative genes, which can provide only an incomplete picture of the biological complexity. Results: Microarray data sets typically contain far more features (genes) than samples. For this type of data, we demonstrate that there are many equivalently-predictive subsets of genes. We iteratively train a classifier using features identified via a sparse support vector machine. At each iteration, we remove all the features that were previously selected. We found that we could iterate many times before a sustained drop in accuracy occurs, with each iteration removing approximately 30 genes from consideration. The classification accuracy on test data remains essentially flat even as hundreds of top-genes are removed. Our method identifies sets of genes that are highly predictive, even when comprised of genes that individually are not. Through automated and manual analysis of the selected genes, we demonstrate that the selected features expose relevant pathways that other approaches would have missed. Conclusions: Our results challenge the paradigm of using feature selection techniques to design parsimonious classifiers from microarray and similar high-dimensional, small-sample-size data sets. The fact that there are many subsets of genes that work equally well to classify the data provides a strong counter-result to the notion that there is a small number of "top genes" that should be used to build classifiers. In our results, the best classifiers were formed using genes with limited univariate power, thus illustrating that deeper mining of features using multivariate techniques is important.
引用
收藏
页数:15
相关论文
共 31 条
[1]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]   Tissue classification with gene expression profiles [J].
Ben-Dor, A ;
Bruhn, L ;
Friedman, N ;
Nachman, I ;
Schummer, M ;
Yakhini, Z .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (3-4) :559-583
[3]   GATHER: a systems approach to interpreting genomic signatures [J].
Chang, Jeffrey T. ;
Nevins, Joseph R. .
BIOINFORMATICS, 2006, 22 (23) :2926-2933
[4]   Detection of Viruses Via Statistical Gene Expression Analysis [J].
Chen, Minhua ;
Carlson, David ;
Zaas, Aimee ;
Woods, Christopher W. ;
Ginsburg, Geoffrey S. ;
Hero, Alfred, III ;
Lucas, Joseph ;
Carin, Lawrence .
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 2011, 58 (03) :468-479
[5]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[6]   Ensemble methods in machine learning [J].
Dietterich, TG .
MULTIPLE CLASSIFIER SYSTEMS, 2000, 1857 :1-15
[7]   Infection, fever, and exogenous and endogenous pyrogens: some concepts have changed [J].
Dinarello, CA .
JOURNAL OF ENDOTOXIN RESEARCH, 2004, 10 (04) :201-222
[8]   Interleukin-15 mediates potent antiviral responses via an interferon-dependent mechanism [J].
Foong, Y. Y. ;
Jans, D. A. ;
Rolph, M. S. ;
Gahan, M. E. ;
Mahalingam, S. .
VIROLOGY, 2009, 393 (02) :228-237
[9]   Regulation of the Migration and Survival of Monocyte Subsets by Chemokine Receptors and Its Relevance to Atherosclerosis [J].
Gautier, Emmanuel L. ;
Jakubzick, Claudia ;
Randolph, Gwendalyn J. .
ARTERIOSCLEROSIS THROMBOSIS AND VASCULAR BIOLOGY, 2009, 29 (10) :1412-1418
[10]  
Gordon GJ, 2002, CANCER RES, V62, P4963