Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes

被引:176
作者
Jirapech-Umpai, T [1 ]
Aitken, S [1 ]
机构
[1] Univ Edinburgh, Sch Informat, Edinburgh EH8 9LE, Midlothian, Scotland
关键词
D O I
10.1186/1471-2105-6-148
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: In the clinical context, samples assayed by microarray are often classified by cell line or tumour type and it is of interest to discover a set of genes that can be used as class predictors. The leukemia dataset of Golub et al. [ 1] and the NCI60 dataset of Ross et al. [ 2] present multiclass classification problems where three tumour types and nine cell lines respectively must be identified. We apply an evolutionary algorithm to identify the near-optimal set of predictive genes that classify the data. We also examine the initial gene selection step whereby the most informative genes are selected from the genes assayed. Results: In the absence of feature selection, classification accuracy on the training data is typically good, but not replicated on the testing data. Gene selection using the RankGene software [ 3] is shown to significantly improve performance on the testing data. Further, we show that the choice of feature selection criteria can have a significant effect on accuracy. The evolutionary algorithm is shown to perform stably across the space of possible parameter settings - indicating the robustness of the approach. We assess performance using a low variance estimation technique, and present an analysis of the genes most often selected as predictors. Conclusion: The computational methods we have developed perform robustly and accurately, and yield results in accord with clinical knowledge: A Z-score analysis of the genes most frequently selected identifies genes known to discriminate AML and Pre-T ALL leukemia. This study also confirms that significantly different sets of genes are found to be most discriminatory as the sample classes are refined.
引用
收藏
页数:11
相关论文
共 15 条
[1]  
[Anonymous], 2001, An introduction to genetic algorithms
[2]  
Ben-Dor A., 2000, RECOMB 2000. Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, P54, DOI 10.1145/332306.332328
[3]  
Béné MC, 1999, HAEMATOLOGICA, V84, P1024
[4]   Is cross-validation valid for small-sample microarray classification? [J].
Braga-Neto, UM ;
Dougherty, ER .
BIOINFORMATICS, 2004, 20 (03) :374-380
[5]   Evolutionary algorithms for finding optimal gene sets in microarray prediction [J].
Deutsch, JM .
BIOINFORMATICS, 2003, 19 (01) :45-52
[6]  
DUDOIT S, 2000, 576 MATH SCI RES I
[7]   Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537
[8]   Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method [J].
Li, LP ;
Weinberg, CR ;
Darden, TA ;
Pedersen, LG .
BIOINFORMATICS, 2001, 17 (12) :1131-1142
[9]   A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression [J].
Li, T ;
Zhang, CL ;
Ogihara, M .
BIOINFORMATICS, 2004, 20 (15) :2429-2437
[10]  
LI W, 2003, P 7 INT C RES COMP M, P217