Sparse data bias: a problem hiding in plain sight

被引:686
作者
Greenland, Sander [1 ,2 ]
Mansournia, Mohammad Ali [3 ]
Altman, Douglas G. [4 ]
机构
[1] Univ Calif Los Angeles, Dept Epidemiol, Los Angeles, CA USA
[2] Univ Calif Los Angeles, Dept Stat, Los Angeles, CA USA
[3] Univ Tehran Med Sci, Sch Publ Hlth, Dept Epidemiol & Biostat, POB 14155-6446, Tehran, Iran
[4] Univ Oxford, Nuffield Dept Orthopaed Rheumatol & Musculoskelet, Ctr Stat Med, Oxford, England
来源
BMJ-BRITISH MEDICAL JOURNAL | 2016年 / 353卷
关键词
LOGISTIC-REGRESSION; SELECTION; MODEL; LIKELIHOOD; SIMULATION; REDUCTION; EVENTS; IMPACT; RISK;
D O I
10.1136/bmj.i1981
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Effects of treatment or other exposure on outcome events are commonly measured by ratios of risks, rates, or odds. Adjusted versions of these measures are usually estimated by maximum likelihood regression (eg, logistic, Poisson, or Cox modelling). But resulting estimates of effect measures can have serious bias when the data lack adequate case numbers for some combination of exposure and outcome levels. This bias can occur even in quite large datasets and is hence often termed sparse data bias. The bias can arise or be worsened by regression adjustment for potentially confounding variables; in the extreme, the resulting estimates could be impossibly huge or even infinite values that are meaningless artefacts of data sparsity. Such estimate inflation might be obvious in light of background information, but is rarely noted let alone accounted for in research reports. We outline simple methods for detecting and dealing with the problem focusing especially on penalised estimation, which can be easily performed with common software packages.
引用
收藏
页数:6
相关论文
共 33 条
[21]  
Hirji KF., 2006, EXACT ANAL DISCRETE
[22]   THE IMPACT OF MODEL SELECTION ON INFERENCE IN LINEAR-REGRESSION [J].
HURVICH, CM ;
TSAI, CL .
AMERICAN STATISTICIAN, 1990, 44 (03) :214-217
[23]   ON THE BIAS OF COMMONLY USED MEASURES OF ASSOCIATION FOR 2 X-2 TABLES [J].
JEWELL, NP .
BIOMETRICS, 1986, 42 (02) :351-358
[24]   Phenylpropanolamine and the risk of hemorrhagic stroke. [J].
Kernan, WN ;
Viscoli, CM ;
Brass, LM ;
Broderick, JP ;
Brott, T ;
Feldmann, E ;
Morgenstern, LB ;
Wilterdink, JL ;
Horwitz, RI .
NEW ENGLAND JOURNAL OF MEDICINE, 2000, 343 (25) :1826-1832
[25]   Reducing bias and mean squared error associated with regression-based odds ratio estimators [J].
Lyles, Robert H. ;
Guo, Ying ;
Greenland, Sander .
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2012, 142 (12) :3235-3241
[26]   SIMULATION STUDY OF CONFOUNDER-SELECTION STRATEGIES [J].
MALDONADO, G ;
GREENLAND, S .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 1993, 138 (11) :923-936
[27]   On the estimation and use of propensity scores in case-control and case-cohort studies [J].
Mansson, Roger ;
Joffe, Marshall M. ;
Sun, Wenguang ;
Hennessy, Sean .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2007, 166 (03) :332-339
[28]   How to develop a more accurate risk prediction model when there are few events [J].
Pavlou, Menelaos ;
Ambler, Gareth ;
Seaman, Shaun R. ;
Guttmann, Oliver ;
Elliott, Perry ;
King, Michael ;
Omar, Rumana Z. .
BMJ-BRITISH MEDICAL JOURNAL, 2015, 351
[29]   A simulation study of the number of events per variable in logistic regression analysis [J].
Peduzzi, P ;
Concato, J ;
Kemper, E ;
Holford, TR ;
Feinstein, AR .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 1996, 49 (12) :1373-1379
[30]  
Rubin DB, 2006, MATCHED SAMPLING FOR CAUSAL EFFECTS, P1, DOI 10.2277/ 0521674360