Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines

被引:233
作者
Maldonado, Sebastian [1 ]
Weber, Richard [2 ]
Famili, Fazel [3 ]
机构
[1] Univ Los Andes, Santiago, Chile
[2] Univ Chile, Dept Ind Engn, Santiago, Chile
[3] Natl Res Council Canada, Ottawa, ON, Canada
关键词
Feature selection; Imbalanced data set; Dimensionality reduction; Support Vector Machine; Data mining; GENE SELECTION; CLASSIFICATION; CARCINOMAS; SURVIVAL;
D O I
10.1016/j.ins.2014.07.015
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Feature selection and classification of imbalanced data sets are two of the most interesting machine learning challenges, attracting a growing attention from both, industry and academia. Feature selection addresses the dimensionality reduction problem by determining a subset of available features to build a good model for classification or prediction, while the class-imbalance problem arises when the class distribution is too skewed. Both issues have been independently studied in the literature, and a plethora of methods to address high dimensionality as well as class-imbalance has been proposed. The aim of this work is to simultaneously explore both issues, proposing a family of methods that select those attributes that are relevant for the identification of the target class in binary classification. We propose a backward elimination approach based on successive holdout steps, whose contribution measure is based on a balanced loss function obtained on an independent subset. Our experiments are based on six highly imbalanced microarray data sets, comparing our methods with well-known feature selection techniques, and obtaining a better prediction with consistently fewer relevant features. (C) 2014 Elsevier Inc. All rights reserved.
引用
收藏
页码:228 / 246
页数:19
相关论文
共 44 条
[1]  
Abu Shanab A, 2011, 2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P234, DOI 10.1109/IRI.2011.6009552
[2]   DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets [J].
Alibeigi, Mina ;
Hashemi, Sattar ;
Hamzeh, Ali .
DATA & KNOWLEDGE ENGINEERING, 2012, 81-82 :67-103
[3]  
Balasubramanian K., 2013, P 16 INT C ART INT S
[4]   Gene-expression profiles predict survival of patients with lung adenocarcinoma [J].
Beer, DG ;
Kardia, SLR ;
Huang, CC ;
Giordano, TJ ;
Levin, AM ;
Misek, DE ;
Lin, L ;
Chen, GA ;
Gharib, TG ;
Thomas, DG ;
Lizyness, ML ;
Kuick, R ;
Hayasaka, S ;
Taylor, JMG ;
Iannettoni, MD ;
Orringer, MB ;
Hanash, S .
NATURE MEDICINE, 2002, 8 (08) :816-824
[5]   Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J].
Bhattacharjee, A ;
Richards, WG ;
Staunton, J ;
Li, C ;
Monti, S ;
Vasa, P ;
Ladd, C ;
Beheshti, J ;
Bueno, R ;
Gillette, M ;
Loda, M ;
Weber, G ;
Mark, EJ ;
Lander, ES ;
Wong, W ;
Johnson, BE ;
Golub, TR ;
Sugarbaker, DJ ;
Meyerson, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) :13790-13795
[6]   Class prediction for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2010, 11 :523
[7]  
Bradley P. S., 1998, Machine Learning. Proceedings of the Fifteenth International Conference (ICML'98), P82
[8]   Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia [J].
Bullinger, L ;
Döhner, K ;
Bair, E ;
Fröhling, S ;
Schlenk, RF ;
Tibshirani, R ;
Döhner, H ;
Pollack, JR .
NEW ENGLAND JOURNAL OF MEDICINE, 2004, 350 (16) :1605-1616
[9]  
Chawla N. V., 2004, ACM SIGKDD Explorations Newsletter, V6, P1
[10]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)