Analysis of feature selection stability on high dimension and small sample data

被引:71
作者
Dernoncourt, David [1 ,2 ]
Hanczar, Blaise [3 ]
Zucker, Jean-Daniel [1 ,4 ]
机构
[1] Ctr Rech Cordeliers, Inst Natl Sante & Rech Med, U872, F-75006 Paris, France
[2] Univ Paris 06, F-75006 Paris, France
[3] Univ Paris 05, LIPADE, F-75006 Paris, France
[4] UMMISCO, Inst Rech Dev, IRD, UMI 209, F-93143 Bondy, France
关键词
Feature selection; Small sample; Stability; Low N/D ratio; CANCER; CLASSIFICATION; MICROARRAYS; PREDICTION; DISCOVERY;
D O I
10.1016/j.csda.2013.07.012
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Feature selection is an important step when building a classifier on high dimensional data. As the number of observations is small, the feature selection tends to be unstable. It is common that two feature subsets, obtained from different datasets but dealing with the same classification problem, do not overlap significantly. Although it is a crucial problem, few works have been done on the selection stability. The behavior of feature selection is analyzed in various conditions, not exclusively but with a focus on t-score based feature selection approaches and small sample data. The analysis is in three steps: the first one is theoretical using a simple mathematical model; the second one is empirical and based on artificial data; and the last one is based on real data. These three analyses lead to the same results and give a better understanding of the feature selection problem in high dimension data. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:681 / 693
页数:13
相关论文
共 24 条
[1]  
[Anonymous], 2007, ARTIFICIAL INTELLIGE
[2]   Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J].
Bhattacharjee, A ;
Richards, WG ;
Staunton, J ;
Li, C ;
Monti, S ;
Vasa, P ;
Ladd, C ;
Beheshti, J ;
Bueno, R ;
Gillette, M ;
Loda, M ;
Weber, G ;
Mark, EJ ;
Lander, ES ;
Wong, W ;
Johnson, BE ;
Golub, TR ;
Sugarbaker, DJ ;
Meyerson, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) :13790-13795
[3]   Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer [J].
Ein-Dor, L ;
Zuk, O ;
Domany, E .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2006, 103 (15) :5923-5928
[4]  
Frank A., 2010, UCI machine learning repository, V213
[5]   Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537
[6]   Gene selection for cancer classification using support vector machines [J].
Guyon, I ;
Weston, J ;
Barnhill, S ;
Vapnik, V .
MACHINE LEARNING, 2002, 46 (1-3) :389-422
[7]  
Hastie T., 2009, The elements of statistical learning: data mining, inference, and prediction, V2nd
[8]   The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures [J].
Haury, Anne-Claire ;
Gestraud, Pierre ;
Vert, Jean-Philippe .
PLOS ONE, 2011, 6 (12)
[9]   Microarrays and molecular research: noise discovery? [J].
Ioannidis, JPA .
LANCET, 2005, 365 (9458) :454-455
[10]  
Jain A. K., 1982, Handbook of Statistics, V2, P835, DOI [DOI 10.1016/S0169-7161, 10.1016/S0169-7161]