Classification with correlated features: unreliability of feature ranking and solutions

被引:305
作者
Tolosi, Laura [1 ]
Lengauer, Thomas [1 ]
机构
[1] Max Planck Inst Informat, Dept Computat Biol & Appl Algorithm, Saarbrucken, Germany
关键词
GROUP LASSO; GENE; SELECTION; MODELS;
D O I
10.1093/bioinformatics/btr300
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Classification and feature selection of genomics or transcriptomics data is often hampered by the large number of features as compared with the small number of samples available. Moreover, features represented by probes that either have similar molecular functions (gene expression analysis) or genomic locations (DNA copy number analysis) are highly correlated. Classical model selection methods such as penalized logistic regression or random forest become unstable in the presence of high feature correlations. Sophisticated penalties such as group Lasso or fused Lasso can force the models to assign similar weights to correlated features and thus improve model stability and interpretability. In this article, we show that the measures of feature relevance corresponding to the above-mentioned methods are biased such that the weights of the features belonging to groups of correlated features decrease as the sizes of the groups increase, which leads to incorrect model interpretation and misleading feature ranking. Results: With simulation experiments, we demonstrate that Lasso logistic regression, fused support vector machine, group Lasso and random forest models suffer from correlation bias. Using simulations, we show that two related methods for group selection based on feature clustering can be used for correcting the correlation bias. These techniques also improve the stability and the accuracy of the baseline models. We apply all methods investigated to a breast cancer and a bladder cancer arrayCGH dataset and in order to identify copy number aberrations predictive of tumor phenotype.
引用
收藏
页码:1986 / 1994
页数:9
相关论文
共 28 条
[21]   Averaged gene expressions for regression [J].
Park, Mee Young ;
Hastie, Trevor ;
Tibshirani, Robert .
BIOSTATISTICS, 2007, 8 (02) :212-227
[22]   Classification of arrayCGH data using fused SVM [J].
Rapaport, Franck ;
Barillot, Emmanuel ;
Vert, Jean-Philippe .
BIOINFORMATICS, 2008, 24 (13) :I375-I382
[23]   SILHOUETTES - A GRAPHICAL AID TO THE INTERPRETATION AND VALIDATION OF CLUSTER-ANALYSIS [J].
ROUSSEEUW, PJ .
JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 1987, 20 :53-65
[24]   Conditional variable importance for random forests [J].
Strobl, Carolin ;
Boulesteix, Anne-Laure ;
Kneib, Thomas ;
Augustin, Thomas ;
Zeileis, Achim .
BMC BIOINFORMATICS, 2008, 9 (1)
[26]   Gene expression profiling predicts clinical outcome of breast cancer [J].
van't Veer, LJ ;
Dai, HY ;
van de Vijver, MJ ;
He, YDD ;
Hart, AAM ;
Mao, M ;
Peterse, HL ;
van der Kooy, K ;
Marton, MJ ;
Witteveen, AT ;
Schreiber, GJ ;
Kerkhoven, RM ;
Roberts, C ;
Linsley, PS ;
Bernards, R ;
Friend, SH .
NATURE, 2002, 415 (6871) :530-536
[27]  
Yu L., 2008, P 14 ACM KDD 08
[28]   One-step sparse estimates in nonconcave penalized likelihood models [J].
Zou, Hui ;
Li, Runze .
ANNALS OF STATISTICS, 2008, 36 (04) :1509-1533