Stability Analysis of Feature Ranking Techniques on Biological Datasets

被引:15
作者
Dittman, David [1 ]
Khoshgoftaar, Taghi M. [1 ]
Wald, Randall [1 ]
Wang, Huanjing [2 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
[2] Western Kentucky Univ, Bowling Green, KY 42101 USA
来源
2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011) | 2011年
关键词
Stability; DNA Microarray; Feature Selection; FEATURE-SELECTION;
D O I
10.1109/BIBM.2011.84
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
One major problem faced when analyzing DNA microarrays is their high dimensionality (large number of features). Therefore, feature selection is a necessary step when using these datasets. However, the addition or removal of instances can alter the subsets chosen by a feature selection technique. The ideal situation is to choose a feature selection technique that is robust (stable) to changes in the number of instances, with selected features changing little even when instances are added or removed. In this study we test the stability of nineteen feature selection techniques across twentysix datasets with varying levels of class imbalance. Our results show that the best choice of technique depends on the class balance of the datasets. The top performers are Deviance for balanced datasets, Signal to Noise for slightly imbalanced datasets, and AUC for imbalanced datasets. SVM-RFE was the least stable feature selection technique across the board, while other poor performers include Gain Ratio, Gini Index, Probability Ratio, and Power. We also found that enough changes to the dataset can make any feature selection technique unstable, and that using more features increases the stability of most feature selection techniques. Most intriguing was our finding that the more imbalanced a dataset is, the more stable the feature subsets built for that dataset will be. Overall, we conclude that stability is an important aspect of feature ranking which must be taken into account when planning a feature selection strategy or when adding or removing instances from a dataset.
引用
收藏
页码:252 / 256
页数:5
相关论文
共 15 条
[11]  
National Center for Biotechnology Information, 2007, MICR FACTSH
[12]  
Van Hulse J, 2011, 2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P315, DOI 10.1109/IRI.2011.6009566
[13]   Combating the Small Sample Class Imbalance Problem Using Feature Selection [J].
Wasikowski, Mike ;
Chen, Xue-wen .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2010, 22 (10) :1388-1400
[14]  
Wigle D. A., 2002, CANCER RES, P3005
[15]  
Witten I. H., 2005, DATA MINING, V2, P403