Gradient boosting for high-dimensional prediction of rare events

被引:49
作者
Blagus, Rok [1 ]
Lusa, Lara [1 ]
机构
[1] Univ Ljubljana, Inst Biostat & Med Informat, Fac Med, Vrazov Trg 2, Ljubljana 1000, Slovenia
关键词
Gradient boosting; Rare events bias; Regularization through shrinkage and subsampling; Ensemble classifiers; High-dimensional class-prediction; METASTASIS; GENOMICS;
D O I
10.1016/j.csda.2016.07.016
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In clinical research the goal is often to correctly estimate the probability of an event. For this purpose several characteristics of the patients are measured and used to develop a prediction model which can be used to predict the class membership for future patients. Ensemble classifiers are combinations of many different classifiers and they can be useful because combining a set of classifiers can result in more accurate predictions. Gradient boosting is an ensemble classifier which was shown to perform well in the setting where the number of variables exceeds the number of samples (high-dimensional data), however it has not been evaluated for the prediction of rare events. It is demonstrated that Gradient boosting suffers from severe rare events bias, correctly classifying only a small proportion of samples from the rare class. The bias can be removed by using subsampling in combination with appropriate amount of shrinkage but only for a specific number of boosting iterations and for binomial loss function. It is shown that the number of boosting iterations where the rare events bias is removed cannot be estimated efficiently from the training data when the sample size is small. Therefore several corrections for the rare events bias of Gradient boosting are proposed and evaluated by using simulated and real high-dimensional data. It is demonstrated that the proposed corrections successfully remove the rare events bias and outperform the other ensemble classifiers that were considered, Large flexibility and high interpretability of the proposed methods is also illustrated. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:19 / 37
页数:19
相关论文
共 50 条
[1]  
Bishop Christopher M., 2006, Pattern Recognition and Machine Learning, V4
[2]  
Blagus R., 2015, BMC BIOINFORMATICS
[3]   SMOTE for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2013, 14
[4]   Class prediction for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2010, 11 :523
[5]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[9]   Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting [J].
Collins, Gary S. ;
Mallett, Susan ;
Omar, Omar ;
Yu, Ly-Mee .
BMC MEDICINE, 2011, 9
[10]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411