G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays

被引:28
作者
Abdulla, Mai [1 ,2 ]
Khasawneh, Mohammad T. [1 ]
机构
[1] SUNY Binghamton, Dept Syst Sci & Ind Engn, Binghamton, NY 13902 USA
[2] 9301 Avondale RD NE,Apt K1060, Redmond, WA 98052 USA
关键词
Feature selection; Cost-sensitive; Genetic algorithm; Random Forest; Microarray Gene expression; Silent diseases' diagnosis; CANCER CLASSIFICATION; ALGORITHM; HYBRID; FRAMEWORK; MACHINE; DISCOVERY;
D O I
10.1016/j.artmed.2020.101941
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Microarray gene expression profiling has emerged as an efficient technique for cancer diagnosis, prognosis, and treatment. One of the major drawbacks of gene expression microarrays is the "curse of dimensionality", which hinders the usefulness of information in datasets and leads to computational instability. In recent years, feature selection techniques have emerged as effective tools to identify disease biomarkers to aid in medical screening and diagnosis. However, the existing feature selection techniques, first, do not suit the rare variance exists in genomic data; and second, do not consider the feature cost (i.e. gene cost). Because ignoring features' costs may result in high cost gene profiling, this study proposes a new algorithm, called G-Forest, for cost-sensitive feature selection in gene expression microarrays. G-Forest is an ensemble cost-sensitive feature selection algorithm that develops a population of biases for a Random Forest induction algorithm. The G-Forest embeds the feature cost in the feature selection process and allows for simultaneous selection of low-cost and most informative features. In particular, when constructing the initial population, the feature is randomly selected with a probability inversely proportional to its associated cost. The G-Forest was compared with multiple state-of-the-art algorithms. Experimental results showed the effectiveness and robustness of the G-Forest in selecting the least cost and most informative genes. The G-Forest improved accuracy up to 14 % and decreased costs up to 56 % - on average when compared with the other approaches tested in this article.
引用
收藏
页数:11
相关论文
共 41 条
[1]   An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data [J].
Ahmed, Saeed ;
Kabir, Muhammad ;
Ali, Zakir ;
Arif, Muhammad ;
Ali, Farman ;
Yu, Dong-Jun .
COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2018, 21 (09) :631-645
[2]  
Babatunde O. H., 2014, GENET ALGORITHM BASE
[3]   A framework for cost-based feature selection [J].
Bolon-Canedo, V. ;
Porto-Diaz, I. ;
Sanchez-Marono, N. ;
Alonso-Betanzos, A. .
PATTERN RECOGNITION, 2014, 47 (07) :2481-2489
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Reliable classification of two-class cancer data using evolutionary algorithms [J].
Deb, K ;
Reddy, AR .
BIOSYSTEMS, 2003, 72 (1-2) :111-129
[6]   Predicting Hub Genes Associated with Cervical Cancer through Gene Co-Expression Networks [J].
Deng, Su-Ping ;
Zhu, Lin ;
Huang, De-Shuang .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2016, 13 (01) :27-35
[7]   Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks [J].
Deng, Su-Ping ;
Zhu, Lin ;
Huang, De-Shuang .
BMC GENOMICS, 2015, 16
[8]   Binary grey wolf optimization approaches for feature selection [J].
Emary, E. ;
Zawba, Hossam M. ;
Hassanien, Aboul Ella .
NEUROCOMPUTING, 2016, 172 :371-381
[9]   Resampling methods for parameter-free and robust feature selection with mutual information [J].
Francois, D. ;
Rossi, F. ;
Wertz, V. ;
Verleysen, M. .
NEUROCOMPUTING, 2007, 70 (7-9) :1276-1288
[10]   Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537