Comparative Analysis on the Stability of Feature Selection Techniques using Three Frameworks on Biological Datasets

被引:0
作者
Wald, Randall [1 ]
Khoshgoftaar, Taghi [1 ]
Abu Shanab, Ahmad [1 ]
Napolitano, Amri [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
来源
2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 1 | 2013年
关键词
Feature Selection; Stability; Noise Injection; Imbalanced Data; CLASSIFICATION;
D O I
10.1109/ICMLA.2013.85
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature (gene) selection is a common preprocessing technique used to counter the problem of high dimensionality (too many independent features) found in many bioinformatics datasets, addressing this problem by creating a smaller feature subset including only the most important features. Although feature selection techniques are often evaluated based on how they can help improve classification performance, it is also important to find stable feature selection techniques which will give consistent results even in the face of dataset perturbations (such as class noise or sampling used to alleviate the problem of imbalanced data). This is especially important in bioinformatics, where the prime concern may be gene discovery rather than classification. In this study we use three frameworks to evaluate the stability of gene selection techniques: "sampledclean vs. sampled-clean," "sampled-noisy vs. sampled-noisy," and " sampled-clean vs. sampled-noisy." All frameworks involve pairwise comparisons among the results from the perturbed datasets (due to sampling or class noise injection followed by sampling). They differ in terms of whether they observe how sampling can create variation within the feature subsets (sampled-clean vs. sampled-clean), how noisy datasets (which were then sampled) can create a wide spread of selected features (sampled-noisy vs. sampled-noisy), or how features selected on clean and noisy datasets differ, after both datasets have been sampled (sampledclean vs. sampled-noisy). Along with these three frameworks, our comparison of seven feature ranking techniques uses four cancer gene datasets, applies three sampling techniques, and generates artificial class noise to better simulate real-world datasets. The results from the frameworks are generally similar, with Signal-To- Noise and ReliefF showing the best stability and Gain Ratio showing the worst across all three frameworks, although Relief-W is notable for showing moderate to above-average stability when the clean datasets are used, but giving the second worst performance when noise was present.
引用
收藏
页码:418 / 423
页数:6
相关论文
共 19 条
  • [1] Robust biomarker identification for cancer diagnosis with ensemble feature selection methods
    Abeel, Thomas
    Helleputte, Thibault
    Van de Peer, Yves
    Dupont, Pierre
    Saeys, Yvan
    [J]. BIOINFORMATICS, 2010, 26 (03) : 392 - 398
  • [2] [Anonymous], 2014, C4. 5: programs for machine learning
  • [3] [Anonymous], 2007, ICML, DOI DOI 10.1145/1273496.1273614
  • [4] Chen M., 2008, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P124, DOI DOI 10.1145/1401890.1401910
  • [5] FAYYAD UM, 1992, MACH LEARN, V8, P87, DOI 10.1023/A:1022638503176
  • [6] On the Class Imbalance Problem
    Guo, Xinjian
    Yin, Yilong
    Dong, Cailing
    Yang, Gongping
    Zhou, Guangtong
    [J]. ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 4, PROCEEDINGS, 2008, : 192 - 201
  • [7] Guyon I., 2003, J MACH LEARN RES, V3, P1157
  • [8] KIRA K, 1992, AAAI-92 PROCEEDINGS : TENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, P129
  • [9] Kononenko I., 1994, Machine Learning: ECML-94. European Conference on Machine Learning. Proceedings, P171
  • [10] Krízek P, 2007, LECT NOTES COMPUT SC, V4673, P929