Comparative Analysis on the Stability of Feature Selection Techniques using Three Frameworks on Biological Datasets

被引：0

作者：

Wald, Randall ^{[1
]}

Khoshgoftaar, Taghi ^{[1
]}

Abu Shanab, Ahmad ^{[1
]}

Napolitano, Amri ^{[1
]}

机构：

[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA

来源：

2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 1 | 2013年

关键词：

Feature Selection; Stability; Noise Injection; Imbalanced Data; CLASSIFICATION;

D O I：

10.1109/ICMLA.2013.85

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Feature (gene) selection is a common preprocessing technique used to counter the problem of high dimensionality (too many independent features) found in many bioinformatics datasets, addressing this problem by creating a smaller feature subset including only the most important features. Although feature selection techniques are often evaluated based on how they can help improve classification performance, it is also important to find stable feature selection techniques which will give consistent results even in the face of dataset perturbations (such as class noise or sampling used to alleviate the problem of imbalanced data). This is especially important in bioinformatics, where the prime concern may be gene discovery rather than classification. In this study we use three frameworks to evaluate the stability of gene selection techniques: "sampledclean vs. sampled-clean," "sampled-noisy vs. sampled-noisy," and " sampled-clean vs. sampled-noisy." All frameworks involve pairwise comparisons among the results from the perturbed datasets (due to sampling or class noise injection followed by sampling). They differ in terms of whether they observe how sampling can create variation within the feature subsets (sampled-clean vs. sampled-clean), how noisy datasets (which were then sampled) can create a wide spread of selected features (sampled-noisy vs. sampled-noisy), or how features selected on clean and noisy datasets differ, after both datasets have been sampled (sampledclean vs. sampled-noisy). Along with these three frameworks, our comparison of seven feature ranking techniques uses four cancer gene datasets, applies three sampling techniques, and generates artificial class noise to better simulate real-world datasets. The results from the frameworks are generally similar, with Signal-To- Noise and ReliefF showing the best stability and Gain Ratio showing the worst across all three frameworks, although Relief-W is notable for showing moderate to above-average stability when the clean datasets are used, but giving the second worst performance when noise was present.

引用

页码：418 / 423

页数：6

共 19 条

[1] Robust biomarker identification for cancer diagnosis with ensemble feature selection methods
Abeel, Thomas
Helleputte, Thibault
Van de Peer, Yves
Dupont, Pierre
Saeys, Yvan
[J]. BIOINFORMATICS, 2010, 26 (03) : 392 - 398
[2] [Anonymous], 2014, C4. 5: programs for machine learning
[3] [Anonymous], 2007, ICML, DOI DOI 10.1145/1273496.1273614
[4] Chen M., 2008, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P124, DOI DOI 10.1145/1401890.1401910
[5] FAYYAD UM, 1992, MACH LEARN, V8, P87, DOI 10.1023/A:1022638503176
[6] On the Class Imbalance Problem
Guo, Xinjian
Yin, Yilong
Dong, Cailing
Yang, Gongping
Zhou, Guangtong
[J]. ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 4, PROCEEDINGS, 2008, : 192 - 201
[7] Guyon I., 2003, J MACH LEARN RES, V3, P1157
[8] KIRA K, 1992, AAAI-92 PROCEEDINGS : TENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, P129
[9] Kononenko I., 1994, Machine Learning: ECML-94. European Conference on Machine Learning. Proceedings, P171
[10] Krízek P, 2007, LECT NOTES COMPUT SC, V4673, P929

← 1 2 →