Sparse regression for large data sets with outliers

被引:19
作者
Bottmer, Lea [1 ,3 ]
Croux, Christophe [2 ]
Wilms, Ines [3 ]
机构
[1] Stanford Univ, Dept Econ, Stanford, CA 94305 USA
[2] EDHEC Business Sch, Paris, France
[3] Maastricht Univ, Dept Quantitat Econ, Maastricht, Netherlands
关键词
Data science; Lasso; Outliers; Robust regression; Variable selection; HIGH-DIMENSIONAL DATA; SELECTION; ROBUST; REGULARIZATION; SALES; INFORMATION; MODELS;
D O I
10.1016/j.ejor.2021.05.049
中图分类号
C93 [管理学];
学科分类号
12 ; 1201 ; 1202 ; 120202 ;
摘要
The linear regression model remains an important workhorse for data scientists. However, many data sets contain many more predictors than observations. Besides, outliers, or anomalies, frequently occur. This paper proposes an algorithm for regression analysis that addresses these features typical for big data sets, which we call "sparse shooting S". The resulting regression coefficients are sparse, meaning that many of them are set to zero, hereby selecting the most relevant predictors. A distinct feature of the method is its robustness with respect to outliers in the cells of the data matrix. The excellent performance of this robust variable selection and prediction method is shown in a simulation study. A real data application on car fuel consumption demonstrates its usefulness. (c) 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
引用
收藏
页码:782 / 794
页数:13
相关论文
共 52 条
[21]   Neural networks and organizational systems: Modeling non-linear relationships [J].
Grznar, John ;
Prasad, Sameer ;
Tata, Jasmine .
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2007, 181 (02) :939-955
[22]   The value of competitive information in forecasting FMCG retail product sales and the variable selection problem [J].
Huang, Tao ;
Fildes, Robert ;
Soopramanien, Didier .
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2014, 237 (02) :738-748
[23]   Large data sets and machine learning: Applications to statistical arbitrage [J].
Huck, Nicolas .
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2019, 278 (01) :330-342
[24]   Clusterwise support vector linear regression [J].
Joki, Kaisa ;
Bagirov, Adil M. ;
Karmitsa, Napsu ;
Makela, Marko M. ;
Taheri, Sona .
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2020, 287 (01) :19-35
[25]   Robust linear model selection based on least angle regression [J].
Khan, Jafar A. ;
Van Aelst, Stefan ;
Zamar, Ruben H. .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2007, 102 (480) :1289-1299
[26]   Robust and sparse estimation methods for high-dimensional linear and logistic regression [J].
Kurnaz, Fatma Sevinc ;
Hoffmann, Irene ;
Filzmoser, Peter .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2018, 172 :211-222
[27]   Robust neural modeling for the cross-sectional analysis of accounting information [J].
Landajo, Manuel ;
de Andres, Javier ;
Lorca, Pedro .
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2007, 177 (02) :1232-1252
[28]   A mixed integer linear programming support vector machine for cost-effective feature selection [J].
Lee, In Gyu ;
Zhang, Qianqian ;
Yoon, Sang Won ;
Won, Daehan .
KNOWLEDGE-BASED SYSTEMS, 2020, 203
[29]   Robust regression estimation and inference in the presence of cellwise and casewise contamination [J].
Leung, Andy ;
Zhang, Hongyang ;
Zamar, Ruben .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2016, 99 :1-11
[30]   Demand forecasting with high dimensional data: The case of SKU retail sales forecasting with intra- and inter-category promotional information [J].
Ma, Shaohui ;
Fildes, Robert ;
Huang, Tao .
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2016, 249 (01) :245-257