Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes

被引：176

作者：

Jirapech-Umpai, T ^{[1
]}

Aitken, S ^{[1
]}

机构：

[1] Univ Edinburgh, Sch Informat, Edinburgh EH8 9LE, Midlothian, Scotland

来源：

BMC BIOINFORMATICS | 2005年 / 6卷 / 1期

关键词：

D O I：

10.1186/1471-2105-6-148

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: In the clinical context, samples assayed by microarray are often classified by cell line or tumour type and it is of interest to discover a set of genes that can be used as class predictors. The leukemia dataset of Golub et al. [ 1] and the NCI60 dataset of Ross et al. [ 2] present multiclass classification problems where three tumour types and nine cell lines respectively must be identified. We apply an evolutionary algorithm to identify the near-optimal set of predictive genes that classify the data. We also examine the initial gene selection step whereby the most informative genes are selected from the genes assayed. Results: In the absence of feature selection, classification accuracy on the training data is typically good, but not replicated on the testing data. Gene selection using the RankGene software [ 3] is shown to significantly improve performance on the testing data. Further, we show that the choice of feature selection criteria can have a significant effect on accuracy. The evolutionary algorithm is shown to perform stably across the space of possible parameter settings - indicating the robustness of the approach. We assess performance using a low variance estimation technique, and present an analysis of the genes most often selected as predictors. Conclusion: The computational methods we have developed perform robustly and accurately, and yield results in accord with clinical knowledge: A Z-score analysis of the genes most frequently selected identifies genes known to discriminate AML and Pre-T ALL leukemia. This study also confirms that significantly different sets of genes are found to be most discriminatory as the sample classes are refined.

引用

页数：11

共 15 条

[1]

[Anonymous], 2001, An introduction to genetic algorithms

[2]

Ben-Dor A., 2000, RECOMB 2000. Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, P54, DOI 10.1145/332306.332328

[3]

Béné MC, 1999, HAEMATOLOGICA, V84, P1024

[4] Is cross-validation valid for small-sample microarray classification? [J].

Braga-Neto, UM ;

Dougherty, ER .

BIOINFORMATICS, 2004, 20 (03) :374-380

[5] Evolutionary algorithms for finding optimal gene sets in microarray prediction [J].

Deutsch, JM .

BIOINFORMATICS, 2003, 19 (01) :45-52

[6]

DUDOIT S, 2000, 576 MATH SCI RES I

[7] Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].

Golub, TR ;

Slonim, DK ;

Tamayo, P ;

Huard, C ;

Gaasenbeek, M ;

Mesirov, JP ;

Coller, H ;

Loh, ML ;

Downing, JR ;

Caligiuri, MA ;

Bloomfield, CD ;

Lander, ES .

SCIENCE, 1999, 286 (5439) :531-537

[8] Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method [J].

Li, LP ;

Weinberg, CR ;

Darden, TA ;

Pedersen, LG .

BIOINFORMATICS, 2001, 17 (12) :1131-1142

[9] A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression [J].

Li, T ;

Zhang, CL ;

Ogihara, M .

BIOINFORMATICS, 2004, 20 (15) :2429-2437

[10]

LI W, 2003, P 7 INT C RES COMP M, P217

← 1 2 →