Robust linear model selection based on least angle regression

被引:117
作者
Khan, Jafar A. [1 ]
Van Aelst, Stefan [2 ]
Zamar, Ruben H. [3 ]
机构
[1] Univ Dhaka, Dept Stat, Dhaka 1000, Bangladesh
[2] Univ Ghent, Dept Appl Math & Comp Sci, B-9000 Ghent, Belgium
[3] Univ British Columbia, Dept Stat, Vancouver, BC V6T 1Z2, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
bootstrap; computational complexity; robust prediction; stepwise algorithm; Winsorization;
D O I
10.1198/016214507000000950
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In this article we consider the problem of building a linear prediction model when the number of candidate predictors is large and the data possibly contain anomalies that are difficult to visualize and clean. We want to predict the nonoutlying cases; therefore, we need a method that is simultaneously robust and scalable. We consider the stepwise least angle regression (LARS) algorithm which is computationally very efficient but sensitive to outliers. We introduce two different approaches to robustify LARS. The plug-in approach replaces the classical correlations in LARS by robust correlation estimates. The cleaning approach first transforms the data set by shrinking the outliers toward the bulk of the data (which we call multivariate Winsorization) and then applies LARS to the transformed data. We show that the plug in approach is time-efficient and scalable and that the bootstrap can be used to stabilize its results. We recommend using bootstrapped robustified LARS to sequence a number of candidate predictors to form a reduced set from which a more refined model can be selected.
引用
收藏
页码:1289 / 1299
页数:11
相关论文
共 29 条
[1]  
Alqallaf Fatemah A., 2002, P 8 ACM SIGKDD INT C, P14, DOI DOI 10.1145/775047.775050
[2]  
[Anonymous], 1987, ROBUST REGRESSION OU
[3]   Forward search added-variable t-tests and the effect of masked outliers on model selection [J].
Atkinson, AC ;
Riani, M .
BIOMETRIKA, 2002, 89 (04) :939-946
[4]  
Breithaupt H, 2000, EMBO REP, V1, P5
[5]   Robust inference for generalized linear models [J].
Cantoni, E ;
Ronchetti, E .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (455) :1022-1030
[6]   Fitting multiplicative models by robust alternating regressions [J].
Croux, C ;
Filzmoser, P ;
Pison, G ;
Rousseeuw, PJ .
STATISTICS AND COMPUTING, 2003, 13 (01) :23-36
[7]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)
[8]   Multiple SVM-RFE for gene selection in cancer classification with expression data [J].
Duan, KB ;
Rajapakse, JC ;
Wang, HY ;
Azuaje, F .
IEEE TRANSACTIONS ON NANOBIOSCIENCE, 2005, 4 (03) :228-234
[9]   Least angle regression - Rejoinder [J].
Efron, B ;
Hastie, T ;
Johnstone, I ;
Tibshirani, R .
ANNALS OF STATISTICS, 2004, 32 (02) :494-499
[10]   A STATISTICAL VIEW OF SOME CHEMOMETRICS REGRESSION TOOLS [J].
FRANK, IE ;
FRIEDMAN, JH .
TECHNOMETRICS, 1993, 35 (02) :109-135