Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

被引:8
作者
Krautenbacher, Norbert [1 ,2 ]
Theis, Fabian J. [1 ,2 ]
Fuchs, Christiane [1 ,2 ]
机构
[1] Helmholtz Zentrum Munchen, German Res Ctr Environm Hlth, Inst Computat Biol, Munich, Germany
[2] Tech Univ Munich, Dept Math, Munich, Germany
关键词
PREDICTION MODELS; DESIGN; SIZE; MEN; SEX;
D O I
10.1155/2017/7847531
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the randomforest. We propose two newresampling-basedmethods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.
引用
收藏
页数:18
相关论文
共 38 条
[1]  
[Anonymous], 2004, P 21 INT C MACH LEAR, DOI [10.1145/1015330.1015425, DOI 10.1145/1015330.1015425]
[2]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[5]  
Cortes C, 2008, LECT NOTES ARTIF INT, V5254, P38, DOI 10.1007/978-3-540-87987-9_8
[6]   COMPARING THE AREAS UNDER 2 OR MORE CORRELATED RECEIVER OPERATING CHARACTERISTIC CURVES - A NONPARAMETRIC APPROACH [J].
DELONG, ER ;
DELONG, DM ;
CLARKEPEARSON, DI .
BIOMETRICS, 1988, 44 (03) :837-845
[7]   USING SAMPLE SURVEY WEIGHTS IN MULTIPLE-REGRESSION ANALYSES OF STRATIFIED SAMPLES [J].
DUMOUCHEL, WH ;
DUNCAN, GJ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1983, 78 (383) :535-543
[8]  
Elkan C, 2001, P 17 INT JOINT C ART, P973, DOI DOI 10.5555/1642194.1642224
[9]  
Fahrmeir L., 2009, STATISTIK IHRE ANWEN
[10]  
Fan W, 2007, PROCEEDINGS OF THE SEVENTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, P320