Stability Analysis of Feature Ranking Techniques on Biological Datasets

被引:15
作者
Dittman, David [1 ]
Khoshgoftaar, Taghi M. [1 ]
Wald, Randall [1 ]
Wang, Huanjing [2 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
[2] Western Kentucky Univ, Bowling Green, KY 42101 USA
来源
2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011) | 2011年
关键词
Stability; DNA Microarray; Feature Selection; FEATURE-SELECTION;
D O I
10.1109/BIBM.2011.84
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
One major problem faced when analyzing DNA microarrays is their high dimensionality (large number of features). Therefore, feature selection is a necessary step when using these datasets. However, the addition or removal of instances can alter the subsets chosen by a feature selection technique. The ideal situation is to choose a feature selection technique that is robust (stable) to changes in the number of instances, with selected features changing little even when instances are added or removed. In this study we test the stability of nineteen feature selection techniques across twentysix datasets with varying levels of class imbalance. Our results show that the best choice of technique depends on the class balance of the datasets. The top performers are Deviance for balanced datasets, Signal to Noise for slightly imbalanced datasets, and AUC for imbalanced datasets. SVM-RFE was the least stable feature selection technique across the board, while other poor performers include Gain Ratio, Gini Index, Probability Ratio, and Power. We also found that enough changes to the dataset can make any feature selection technique unstable, and that using more features increases the stability of most feature selection techniques. Most intriguing was our finding that the more imbalanced a dataset is, the more stable the feature subsets built for that dataset will be. Overall, we conclude that stability is an important aspect of feature ranking which must be taken into account when planning a feature selection strategy or when adding or removing instances from a dataset.
引用
收藏
页码:252 / 256
页数:5
相关论文
共 15 条
[1]   Robust biomarker identification for cancer diagnosis with ensemble feature selection methods [J].
Abeel, Thomas ;
Helleputte, Thibault ;
Van de Peer, Yves ;
Dupont, Pierre ;
Saeys, Yvan .
BIOINFORMATICS, 2010, 26 (03) :392-398
[2]   Gene-expression profiles predict survival of patients with lung adenocarcinoma [J].
Beer, DG ;
Kardia, SLR ;
Huang, CC ;
Giordano, TJ ;
Levin, AM ;
Misek, DE ;
Lin, L ;
Chen, GA ;
Gharib, TG ;
Thomas, DG ;
Lizyness, ML ;
Kuick, R ;
Hayasaka, S ;
Taylor, JMG ;
Iannettoni, MD ;
Orringer, MB ;
Hanash, S .
NATURE MEDICINE, 2002, 8 (08) :816-824
[3]  
Berenson M.L., 1983, INTERMEDIATE STAT ME
[4]   Class prediction for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2010, 11 :523
[5]  
Chen M., 2008, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P124, DOI DOI 10.1145/1401890.1401910
[6]  
Dittman D. J., 2010, 2010 Ninth International Conference on Machine Learning and Applications (ICMLA 2010), P147, DOI 10.1109/ICMLA.2010.29
[7]  
Kamal AHM, 2009, LECT N BIOINFORMAT, V5462, P259, DOI 10.1007/978-3-642-00727-9_25
[8]  
Kuncheva LI, 2007, PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND APPLICATIONS, P390
[9]  
Liu Huiqing, 2002, Genome Inform, V13, P51
[10]   Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib [J].
Mulligan, George ;
Mitsiades, Constantine ;
Bryant, Barb ;
Zhan, Fenghuang ;
Chng, Wee J. ;
Roels, Steven ;
Koenig, Erik ;
Fergus, Andrew ;
Huang, Yongsheng ;
Richardson, Paul ;
Trepicchio, William L. ;
Broyl, Annemiek ;
Sonneveld, Pieter ;
Shaughnessy, John D., Jr. ;
Bergsagel, P. Leif ;
Schenkein, David ;
Esseltine, Dixie-Lee ;
Boral, Anthony ;
Anderson, Kenneth C. .
BLOOD, 2007, 109 (08) :3177-3188