Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data

被引:2
作者
Nadeem, Khurram [1 ]
Jabri, Mehdi-Abderrahman [1 ]
机构
[1] Univ Guelph, Guelph, ON, Canada
关键词
MODELS;
D O I
10.1371/journal.pone.0280258
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.
引用
收藏
页数:26
相关论文
共 51 条
[11]   Feature selection for imbalanced data based on neighborhood rough sets [J].
Chen, Hongmei ;
Li, Tianrui ;
Fan, Xin ;
Luo, Chuan .
INFORMATION SCIENCES, 2019, 483 :1-20
[12]   Selecting critical features for data classification based on machine learning methods [J].
Chen, Rung-Ching ;
Dewi, Christine ;
Huang, Su-Wen ;
Caraka, Rezzy Eko .
JOURNAL OF BIG DATA, 2020, 7 (01)
[13]   A logistic regression model for consumer default risk [J].
Costa e Silva, Eliana ;
Lopes, Isabel Cristina ;
Correia, Aldina ;
Faria, Susana .
JOURNAL OF APPLIED STATISTICS, 2020, 47 (13-15) :2879-2894
[14]  
Eckley I.A., 2011, Bayesian Time Series Models, chapter 10 Analysis of changepoint models, P205, DOI [DOI 10.1017/CBO9780511984679.011, 10.1017/CBO9780511984679.011]
[15]   Variable selection via nonconcave penalized likelihood and its oracle properties [J].
Fan, JQ ;
Li, RZ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (456) :1348-1360
[16]   Regularization Paths for Generalized Linear Models via Coordinate Descent [J].
Friedman, Jerome ;
Hastie, Trevor ;
Tibshirani, Rob .
JOURNAL OF STATISTICAL SOFTWARE, 2010, 33 (01) :1-22
[17]  
Friedman Jerome., 2001, The elements of statistical learning, V1, DOI DOI 10.1007/B94608
[18]   Stable variable selection of class-imbalanced data with precision-recall criterion [J].
Fu, Guang-Hui ;
Xu, Feng ;
Zhang, Bing-Yang ;
Yi, Lun-Zhao .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2017, 171 :241-250
[19]   Monitoring Canada's forests: The National Forest Inventory [J].
Gillis, MD ;
Omule, AY ;
Brierley, T .
FORESTRY CHRONICLE, 2005, 81 (02) :214-221
[20]  
Government of British Columbia, 2020, WILDF SEAS SUMM