Adequate sample size for developing prediction models is not simply related to events per variable

被引:338
作者
Ogundimu, Emmanuel O. [1 ]
Altman, Douglas G. [1 ]
Collins, Gary S. [1 ]
机构
[1] Univ Oxford, Botnar Res Ctr, Nuffield Dept Orthopaed Rheumatol & Musculoskelet, Ctr Stat Med, Windmill Rd, Oxford OX3 7LD, England
基金
英国医学研究理事会;
关键词
Events per variable; Cox model; External validation; Predictive modeling; Sample size; Resampling study; LOGISTIC-REGRESSION ANALYSIS; COX REGRESSION; LIKELIHOOD; SIMULATION; NUMBER; SEPARATION; ACCURACY; BIAS;
D O I
10.1016/j.jclinepi.2016.02.031
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Objectives: The choice of an adequate sample size for a Cox regression analysis is generally based on the rule of thumb derived from simulation studies of a minimum of 10 events per variable (EPV). One simulation study suggested scenarios in which the 10 EPV rule can be relaxed. The effect of a range of binary predictors with varying prevalence, reflecting clinical practice, has not yet been fully investigated. Study Design and Setting: We conducted an extended resampling study using a large general-practice data set, comprising over 2 million anonymized patient records, to examine the EPV requirements for prediction models with low-prevalence binary predictors developed using Cox regression. The performance of the models was then evaluated using an independent external validation data set. We investigated both fully specified models and models derived using variable selection. Results: Our results indicated that an EPV rule of thumb should be data driven and that EPV >= 20 generally eliminates bias in regression coefficients when many low-prevalence predictors are included in a Cox model. Conclusion: Higher EPV is needed when low-prevalence predictors are present in a model to eliminate bias in regression coefficients and improve predictive-accuracy. (C) 2016 The Authors. Published by Elsevier Inc.
引用
收藏
页码:175 / 182
页数:8
相关论文
共 18 条
[1]  
ALBERT A, 1984, BIOMETRIKA, V71, P1
[2]   An evaluation of penalised survival methods for developing prognostic models with rare events [J].
Ambler, G. ;
Seaman, S. ;
Omar, R. Z. .
STATISTICS IN MEDICINE, 2012, 31 (11-12) :1150-1161
[3]   The design of simulation studies in medical statistics [J].
Burton, Andrea ;
Altman, Douglas G. ;
Royston, Patrick ;
Holder, Roger L. .
STATISTICS IN MEDICINE, 2006, 25 (24) :4279-4292
[4]   Importance of events per independent variable in proportional hazards analysis .1. Background, goals, and general strategy [J].
Concato, J ;
Peduzzi, P ;
Holford, TR ;
Feinstein, AR .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 1995, 48 (12) :1495-1501
[5]   Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure [J].
Courvoisier, Delphine S. ;
Combescure, Christophe ;
Agoritsas, Thomas ;
Gayet-Ageron, Angele ;
Perneger, Thomas V. .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 2011, 64 (09) :993-1000
[6]   BIAS REDUCTION OF MAXIMUM-LIKELIHOOD-ESTIMATES [J].
FIRTH, D .
BIOMETRIKA, 1993, 80 (01) :27-38
[7]  
HARRELL FE, 1985, CANCER TREAT REP, V69, P1071
[8]   A solution to the problem of separation in logistic regression [J].
Heinze, G ;
Schemper, M .
STATISTICS IN MEDICINE, 2002, 21 (16) :2409-2419
[9]   A solution to the problem of monotone likelihood in Cox regression [J].
Heinze, G ;
Schemper, L .
BIOMETRICS, 2001, 57 (01) :114-119
[10]  
Heinze G, 2012, STAT METHODS MED RES, V21, P660, DOI 10.1177/0962280212440533