Large-scale attribute selection using wrappers

被引:166
作者
Guetlein, Martin [1 ]
Frank, Eibe [2 ]
Hall, Mark [3 ]
Karwath, Andreas [1 ]
机构
[1] Albert Ludwigs Univ Freiburg, Dept Comp Sci, D-7800 Freiburg, Germany
[2] Univ Waikato, Dept Comp Sci, Hamilton, New Zealand
[3] Pentaho Corp, Orlando, FL USA
来源
2009 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING | 2009年
关键词
PREDICTION; CANCER;
D O I
10.1109/CIDM.2009.4938668
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scheme-specific attribute selection with the wrapper and variants of forward selection is a popular attribute selection technique for classification that yields good results. However, it can run the risk of overfitting because of the extent of the search and the extensive use of internal cross-validation. Moreover, although wrapper evaluators tend to achieve superior accuracy compared to filters, they face a high computational cost. The problems of overfitting and high runtime occur in particular on high-dimensional datasets, like microarray data. We investigate Linear Forward Selection, a technique to reduce the number of attributes expansions in each forward selection step. Our experiments demonstrate that this approach is faster, finds smaller subsets and can even increase the accuracy compared to standard forward selection. We also investigate a variant that applies explicit subset size determination in forward selection to combat overfitting, where the search is forced to stop at a precomputed "optimal" subset size. We show that this technique reduces subset size while maintaining comparable accuracy.
引用
收藏
页码:332 / 339
页数:8
相关论文
共 28 条
  • [1] MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia
    Armstrong, SA
    Staunton, JE
    Silverman, LB
    Pieters, R
    de Boer, ML
    Minden, MD
    Sallan, SE
    Lander, ES
    Golub, TR
    Korsmeyer, SJ
    [J]. NATURE GENETICS, 2002, 30 (01) : 41 - 47
  • [2] Bay S. D., 2000, ACM SIGKDD Explorations Newsletter, V2, P81
  • [3] Gene-expression profiles predict survival of patients with lung adenocarcinoma
    Beer, DG
    Kardia, SLR
    Huang, CC
    Giordano, TJ
    Levin, AM
    Misek, DE
    Lin, L
    Chen, GA
    Gharib, TG
    Thomas, DG
    Lizyness, ML
    Kuick, R
    Hayasaka, S
    Taylor, JMG
    Iannettoni, MD
    Orringer, MB
    Hanash, S
    [J]. NATURE MEDICINE, 2002, 8 (08) : 816 - 824
  • [4] Bermejo P., 2008, P IPMU, V08, P638
  • [5] DEGROEVE S, 2002, ECCB, P75
  • [6] Frank E, 2005, DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK, P1305, DOI 10.1007/0-387-25465-X_62
  • [7] Feature selection for support vector machines by means of genetic algorithms
    Fröhlich, H
    Chapelle, O
    Schölkopf, B
    [J]. 15TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2003, : 142 - 148
  • [8] Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring
    Golub, TR
    Slonim, DK
    Tamayo, P
    Huard, C
    Gaasenbeek, M
    Mesirov, JP
    Coller, H
    Loh, ML
    Downing, JR
    Caligiuri, MA
    Bloomfield, CD
    Lander, ES
    [J]. SCIENCE, 1999, 286 (5439) : 531 - 537
  • [9] GUTLEIN M, 2006, THESIS ALBERT LUDWIG
  • [10] A supervised machine learning algorithm for arrhythmia analysis
    Guvenir, HA
    Acar, B
    Demiroz, G
    Cekin, A
    [J]. COMPUTERS IN CARDIOLOGY 1997, VOL 24, 1997, 24 : 433 - 436