Similarity of feature selection methods: An empirical study across data intensive classification tasks

被引:59
作者
Dessi, Nicoletta [1 ]
Pes, Barbara [1 ]
机构
[1] Univ Cagliari, Dipartimento Matemat & Informat, I-09124 Cagliari, Italy
关键词
Data mining; Knowledge discovery; Feature selection; Similarity measures; GENE SELECTION; FEATURE-EXTRACTION; PREDICTION; CANCER; ALGORITHMS; REDUCTION; SYSTEM; TUMOR;
D O I
10.1016/j.eswa.2015.01.069
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the past two decades, the dimensionality of datasets involved in machine learning and data mining applications has increased explosively. Therefore, feature selection has become a necessary step to make the analysis more manageable and to extract useful knowledge about a given domain. A large variety of feature selection techniques are available in literature, and their comparative analysis is a very difficult task. So far, few studies have investigated, from a theoretical and/or experimental point of view, the degree of similarity/dissimilarity among the available techniques, namely the extent to which they tend to produce similar results within specific application contexts. This kind of similarity analysis is of crucial importance when two or more methods are combined in an ensemble fashion: indeed the ensemble paradigm is beneficial only if the involved methods are capable of giving different and complementary representations of the considered domain. This paper gives a contribution in this direction by proposing an empirical approach to evaluate the degree of consistency among the outputs of different selection algorithms in the context of high dimensional classification tasks. Leveraging on a proper similarity index, we systematically compared the feature subsets selected by eight popular selection methods, representatives of different selection approaches, and derived a similarity trend for feature subsets of increasing size. Through an extensive experimentation involving sixteen datasets from three challenging domains (Internet advertisements, text categorization and micro-array data classification), we obtained useful insight into the pattern of agreement of the considered methods. In particular, our results revealed how multivariate selection approaches systematically produce feature subsets that overlap to a small extent with those selected by the other methods. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:4632 / 4642
页数:11
相关论文
共 69 条
  • [1] Robust biomarker identification for cancer diagnosis with ensemble feature selection methods
    Abeel, Thomas
    Helleputte, Thibault
    Van de Peer, Yves
    Dupont, Pierre
    Saeys, Yvan
    [J]. BIOINFORMATICS, 2010, 26 (03) : 392 - 398
  • [2] A comparative study of feature selection and classification methods for gene expression data of glioma
    Abusamra, Heba
    [J]. 4TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS-BIOLOGY AND BIOINFORMATICS (CSBIO2013), 2013, 23 : 5 - 14
  • [3] Akerkar R., 2009, Knowledge-based systems
  • [4] Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
    Alon, U
    Barkai, N
    Notterman, DA
    Gish, K
    Ybarra, S
    Mack, D
    Levine, AJ
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) : 6745 - 6750
  • [5] Altidor W., 2011, Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, P453
  • [6] Altidor W, 2011, HANDBOOK OF DATA INTENSIVE COMPUTING, P349, DOI 10.1007/978-1-4614-1415-5_13
  • [7] [Anonymous], 2003, P ACM S APPL COMP
  • [8] [Anonymous], 2000, Pattern Classification
  • [9] Gene-expression profiles predict survival of patients with lung adenocarcinoma
    Beer, DG
    Kardia, SLR
    Huang, CC
    Giordano, TJ
    Levin, AM
    Misek, DE
    Lin, L
    Chen, GA
    Gharib, TG
    Thomas, DG
    Lizyness, ML
    Kuick, R
    Hayasaka, S
    Taylor, JMG
    Iannettoni, MD
    Orringer, MB
    Hanash, S
    [J]. NATURE MEDICINE, 2002, 8 (08) : 816 - 824
  • [10] Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering
    Bharti, Kusum Kumari
    Singh, Pramod Kumar
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (06) : 3105 - 3114