Minimum sample size for developing a multivariable prediction model: Part I - Continuous outcomes

被引:170
作者
Riley, Richard D. [1 ]
Snell, Kym I. E. [1 ]
Ensor, Joie [1 ]
Burke, Danielle L. [1 ]
Harrell, Frank E., Jr. [2 ]
Moons, Karel G. M. [3 ]
Collins, Gary S. [4 ]
机构
[1] Keele Univ, Res Inst Primary Care & Hlth Sci, Ctr Prognosis Res, Keele ST5 5BG, Staffs, England
[2] Vanderbilt Univ, Sch Med, Dept Biostat, Nashville, TN 37212 USA
[3] Univ Med Ctr Utrecht, Julius Ctr Hlth Sci & Primary Care, Utrecht, Netherlands
[4] Univ Oxford, Nuffield Dept Orthopaed Rheumatol & Musculoskelet, Ctr Stat Med, Oxford, England
关键词
continuous outcome; linear regression; minimum sample size; multivariable prediction model; R-squared; CONFIDENCE-INTERVALS; LINEAR-REGRESSION; STATISTICAL POWER; LIKELIHOOD; SHRINKAGE; ACCURACY;
D O I
10.1002/sim.7993
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
In the medical literature, hundreds of prediction models are being developed to predict health outcomes in individuals. For continuous outcomes, typically a linear regression model is developed to predict an individual's outcome value conditional on values of multiple predictors (covariates). To improve model development and reduce the potential for overfitting, a suitable sample size is required in terms of the number of subjects (n) relative to the number of predictor parameters (p) for potential inclusion. We propose that the minimum value of n should meet the following four key criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of >= 0.9; (ii) small absolute difference of <= 0.05 in the apparent and adjusted R-2; (iii) precise estimation (a margin of error <= 10% of the true value) of the model's residual standard deviation; and similarly, (iv) precise estimation of the mean predicted outcome value (model intercept). The criteria require prespecification of the user's chosen p and the model's anticipated R-2 as informed by previous studies. The value of n that meets all four criteria provides the minimum sample size required for model development. In an applied example, a new model to predict lung function in African-American women using 25 predictor parameters requires at least 918 subjects to meet all criteria, corresponding to at least 36.7 subjects per predictor parameter. Even larger sample sizes may be needed to additionally ensure precise estimates of key predictor effects, especially when important categorical predictors have low prevalence in certain categories.
引用
收藏
页码:1262 / 1275
页数:14
相关论文
共 39 条
  • [1] Determining sample size for accurate estimation of the squared multiple correlation coefficient
    Algina, J
    Olejnik, S
    [J]. MULTIVARIATE BEHAVIORAL RESEARCH, 2000, 35 (01) : 119 - 136
  • [2] [Anonymous], 2009, CLIN PREDICTION MODE
  • [3] [Anonymous], 1992, BREAKTHROUGHS STAT
  • [4] [Anonymous], 2018, MBESS VERSION 4 0 0
  • [5] [Anonymous], 1964, Econometric theory
  • [6] [Anonymous], PLOS MED
  • [7] [Anonymous], 2012, THESIS
  • [8] The number of subjects per variable required in linear regression analyses
    Austin, Peter C.
    Steyerberg, Ewout W.
    [J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 2015, 68 (06) : 627 - 636
  • [9] Collins GS, 2015, ANN INTERN MED, V162, P55, DOI [10.1016/j.jclinepi.2014.11.010, 10.1038/bjc.2014.639, 10.1136/bmj.g7594, 10.1016/j.eururo.2014.11.025, 10.7326/M14-0697, 10.1186/s12916-014-0241-z, 10.1002/bjs.9736, 10.7326/M14-0698]
  • [10] Copas J B, 1997, Stat Methods Med Res, V6, P167, DOI 10.1191/096228097667367976