A Nested Genetic Algorithm for feature selection in high-dimensional cancer Microarray datasets

被引:156
作者
Sayed, Sabah [1 ]
Nassef, Mohammad [1 ]
Badr, Amr [1 ]
Farag, Ibrahim [1 ]
机构
[1] Cairo Univ, Dept Comp Sci, Fac Comp & Informat, 5 Dr Ahmed Zewail St, Giza, Egypt
关键词
Microarray gene expression; DNA Methylation; Colon cancer; Lung cancer; Machine learning; Genetic algorithm; Feature selection; Support Vector Machine; EXPRESSION DATA; CLASSIFICATION; IDENTIFICATION; METHYLATION; MULTICLASS; MARKERS; TOOL;
D O I
10.1016/j.eswa.2018.12.022
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cancer is a dangerous disease that causes death worldwide. Discovering few genes relevant to one cancer disease can result in effective treatments. The challenge associated with the Microarray datasets is its high dimensionality; the huge number of features compared to the modest number of samples in these datasets. Recent research efforts attempted to reduce this high-dimensionality using different feature selection techniques. This paper presents an ensemble feature selection technique based on t-test and genetic algorithm. After preprocessing the data using t-test, a Nested Genetic Algorithm, namely Nested-GA, is used to get the optimal subset of features by combining data from two different datasets. Nested-GA consists of two Nested Genetic Algorithms (outer and inner) that run on two different kinds of datasets. The Outer Genetic Algorithm (OGA-SVM) works on Microarray gene expression datasets, whereas the Inner Genetic Algorithm (IGA-NNW) runs on DNA Methylation datasets. Nested-GA is performed on a colon cancer dataset with 5-fold cross validation. After applying Nested-GA, the Incremental Feature Selection (IFS) strategy is used to get the smallest optimal genes subset. The genes subset has been validated on an independent dataset resulting in 99.9% classification accuracy. Consequently, the biological significance of the resulting optimal genes is validated using Enrichment Analysis. Moreover, the results of Nested-GA have been compared to the results of other feature selection algorithms that have been run on either Gene Expression or DNA Methylation datasets. From the experimental results, Nested-GA showed the highest classification performance with a small optimal feature subset compared to the other algorithms. Furthermore, by running Nested-GA on lung cancer datasets that contain two different cancer subtypes, it resulted in significantly better classification accuracy (98.4%) compared to the accuracy of a previous research (84.6%) that utilized lung cancer DNA-Methylation data only. (C) 2018 Elsevier Ltd. All rights reserved.
引用
收藏
页码:233 / 243
页数:11
相关论文
共 52 条
[1]   A comparative study of feature selection and classification methods for gene expression data of glioma [J].
Abusamra, Heba .
4TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS-BIOLOGY AND BIOINFORMATICS (CSBIO2013), 2013, 23 :5-14
[2]   On the statistical assessment of classifiers using DNA microarray data [J].
Ancona, N. ;
Maglietta, R. ;
Piepoli, A. ;
D'Addabbo, A. ;
Cotugno, R. ;
Savino, M. ;
Liuni, S. ;
Carella, M. ;
Pesole, G. ;
Perri, F. .
BMC BIOINFORMATICS, 2006, 7 (1)
[3]   Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection [J].
Ang, Jun Chin ;
Mirzal, Andri ;
Haron, Habibollah ;
Hamed, Haza Nuzly Abdull .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2016, 13 (05) :971-989
[4]  
[Anonymous], 2015, DEEP LEARNING H2O
[5]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[6]   Identification by Real-time PCR of 13 mature microRNAs differentially expressed in colorectal cancer and non-tumoral tissues [J].
Bandres, E. ;
Cubedo, E. ;
Agirre, X. ;
Malumbres, R. ;
Zarate, R. ;
Ramirez, N. ;
Abajo, A. ;
Navarro, A. ;
Moreno, I. ;
Monzo, M. ;
Garcia-Foncillas, J. .
MOLECULAR CANCER, 2006, 5 (1)
[7]   Comparative Correlation Structure of Colon Cancer Locus Specific Methylation: Characterisation of Patient Profiles and Potential Markers across 3 Array-Based Datasets [J].
Barat, Ana ;
Ruskin, Heather J. .
JOURNAL OF CANCER, 2015, 6 (08) :795-811
[8]  
Bianchini M, 2006, INT J ONCOL, V29, P83
[9]   A hybrid LDA and genetic algorithm for gene selection and classification of microarray data [J].
Bonilla Huerta, Edmundo ;
Duval, Beatrice ;
Hao, Jin-Kao .
NEUROCOMPUTING, 2010, 73 (13-15) :2375-2383
[10]   Classification of lung cancer using ensemble-based feature selection and machine learning methods [J].
Cai, Zhihua ;
Xu, Dong ;
Zhang, Qing ;
Zhang, Jiexia ;
Ngai, Sai-Ming ;
Shao, Jianlin .
MOLECULAR BIOSYSTEMS, 2015, 11 (03) :791-800