Using Correlation-Based Feature Selection for a Diverse Collection of Bioinformatics Datasets

被引:7
作者
Wald, Randall [1 ]
Khoshgoftaar, Taghi M. [1 ]
Napolitano, Amri [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
来源
2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE) | 2014年
关键词
Correlation-Based Feature Selection; Bioinformatics; Balance; Difficulty of Learning; GENE; CLASSIFICATION; SIGNATURE;
D O I
10.1109/BIBE.2014.63
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
The large number of genes found in most gene microarray datasets demands the use of feature selection techniques to alleviate this problem of high-dimensionality. However, the computational cost of filter-based subset evaluation techniques such as Correlation-Based Feature Selection (CFS) has generally limited the use of these techniques to smaller datasets, or at least smaller collections of gene microarray datasets. No previous work has applied CFS to a large and diverse range of bioinformatics datasets. To address this deficit, we employ nine different microarray datasets exhibiting a wide range of characteristics in terms of dataset balance (fraction of instances found in the minority class) and dataset difficulty of learning (overall difficulty of building effective classification models on raw, pre-feature-selection datasets). We also use five classification learners to discover how these perform in conjunction with CFS, along with five performance metrics to give a broad perspective on our results. The results find that CFS can be used to help build effective models, in particular when used with the 5-Nearest Neighbors learner on data that is Easy or Moderate (in terms of difficulty-of-learning) or Balanced (in terms of class distribution). For other types of data, the optimal learner varies, although in most cases the Logistic Regression learner works worst in conjunction with CFS.
引用
收藏
页码:156 / 162
页数:7
相关论文
共 30 条
[1]  
Abu Shanab A, 2012, 2012 IEEE 13TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P415, DOI 10.1109/IRI.2012.6303039
[2]  
Al-Shahib Ali, 2005, Appl Bioinformatics, V4, P195, DOI 10.2165/00822942-200594030-00004
[3]  
[Anonymous], J BIOINFORMATICS INT
[4]  
[Anonymous], THESIS
[5]  
[Anonymous], 1998, MACHINE LEARNING ECM, DOI DOI 10.1007/BFB0026666
[6]  
[Anonymous], 2007, ICML, DOI DOI 10.1145/1273496.1273614
[7]  
[Anonymous], MACH LEARN APPL ICML
[8]   A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems [J].
Batuwita, Rukshan ;
Palade, Vasile .
EIGHTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2009, :545-550
[9]   A gene expression signature that can predict the recurrence of tamoxifen-treated primary breast cancer [J].
Chanrion, Maiea ;
Negre, Vincent ;
Fontaine, Helene ;
Salvetat, Nicolas ;
Bibeau, Frederic ;
Mac Grogan, Gaetan ;
Mauriac, Louis ;
Katsaros, Dionyssios ;
Molina, Franck ;
Theillet, Charles ;
Darbon, Jean-Marie .
CLINICAL CANCER RESEARCH, 2008, 14 (06) :1744-1752
[10]  
Dittman David J., 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI), P341, DOI 10.1109/IRI.2013.6642491