A Hybrid Feature Selection Method for Complex Diseases SNPs

被引:38
作者
Alzubi, Raid [1 ]
Ramzan, Naeem [1 ]
Alzoubi, Hadeel [1 ]
Amira, Abbes [2 ]
机构
[1] Univ West Scotland, Sch Engn & Comp, Paisley PA1 2BE, Renfrew, Scotland
[2] Qatar Univ, Coll Engn, Dept Comp Sci & Engn, Doha, Qatar
关键词
Single nucleotide polymorphism (SNP); feature selection; hybrid algorithms; complex diseases; machine learning; GENETIC ALGORITHM; IDENTIFICATION; CLASSIFICATION; INFORMATION;
D O I
10.1109/ACCESS.2017.2778268
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning techniques have the potential to revolutionize medical diagnosis. Single Nucleotide Polymorphisms (SNPs) are one of the most important sources of human genome variability; thus, they have been implicated in several human diseases. To separate the affected samples from the normal ones, various techniques have been applied on SNPs. Achieving high classification accuracy in such a high-dimensional space is crucial for successful diagnosis and treatment. In this work, we propose an accurate hybrid feature selection method for detecting the most informative SNPs and selecting an optimal SNP subset. The proposed method is based on the fusion of a filter and a wrapper method, i.e., the Conditional Mutual Information Maximization (CMIM) method and the support vector machinerecursive feature elimination, respectively. The performance of the proposed method was evaluated against four state-of-the-art feature selection methods, minimum redundancy maximum relevancy, fast correlation-based feature selection, CMIM, and ReliefF, using four classifiers, support vector machine, naive Bayes, linear discriminant analysis, and k nearest neighbors on five different SNP data sets obtained from the National Center for Biotechnology Information gene expression omnibus genomics data repository. The experimental results demonstrate the efficiency of the adopted feature selection approach outperforming all of the compared feature selection algorithms and achieving up to 96% classification accuracy for the used data set. In general, from these results we conclude that SNPs of the whole genome can be efficiently employed to distinguish affected individuals with complex diseases from the healthy ones.
引用
收藏
页码:1292 / 1301
页数:10
相关论文
共 37 条
[1]   Extracting predictive SNPs in Crohn's disease using a vacillating genetic algorithm and a neural classifier in case-control association studies [J].
Anekboon, Khantharat ;
Lursinsap, Chidchanok ;
Phimoltares, Suphakant ;
Fucharoen, Suthat ;
Tongsima, Sissades .
COMPUTERS IN BIOLOGY AND MEDICINE, 2014, 44 :57-65
[2]  
Batnyam N., STUDIES COMPUTATIONA, V493, P171
[3]   A review of microarray datasets and applied feature selection methods [J].
Bolon-Canedo, V. ;
Sanchez-Marono, N. ;
Alonso-Betanzos, A. ;
Benitez, J. M. ;
Herrera, F. .
INFORMATION SCIENCES, 2014, 282 :111-135
[4]  
Brown G, 2012, J MACH LEARN RES, V13, P27
[5]   Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems [J].
Cao, Kim-Anh Le ;
Boitard, Simon ;
Besse, Philippe .
BMC BIOINFORMATICS, 2011, 12
[6]   Odds ratio-based genetic algorithms for generating SNP barcodes of genotypes to predict disease susceptibility [J].
Chang, Hsueh-Wei ;
Chuang, Li-Yeh ;
Ho, Chang-Hsuan ;
Chang, Phei-Lang ;
Yang, Cheng-Hong .
OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY, 2008, 12 (01) :71-81
[7]  
Cover TM., 1991, ELEMENTS INFORM THEO, V1, P279
[8]  
Dawy Z, 2005, INT CONF ACOUST SPEE, P381
[9]  
Evans D.M., 2010, THESIS, P734
[10]   Research on collaborative negotiation for e-commerce. [J].
Feng, YQ ;
Lei, Y ;
Li, Y ;
Cao, RZ .
2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, :2085-2088