Does data splitting improve prediction?

被引:18
作者
Faraway, Julian J. [1 ]
机构
[1] Univ Bath, Dept Math Sci, Bath BA2 7AY, Avon, England
关键词
Cross-validation; Model assessment; Model uncertainty; Model validation; Prediction; Scoring; MODEL SELECTION; VALIDATION; ERROR;
D O I
10.1007/s11222-014-9522-9
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions.
引用
收藏
页码:49 / 60
页数:12
相关论文
共 36 条
  • [11] Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain
    Dahl, Fredrik A.
    Grotle, Margreth
    Benth, Jurate Saltyte
    Natvig, Bard
    [J]. EUROPEAN JOURNAL OF EPIDEMIOLOGY, 2008, 23 (04) : 237 - 242
  • [12] STATISTICAL-THEORY - THE PREQUENTIAL APPROACH
    DAWID, AP
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 1984, 147 : 278 - 292
  • [14] Faraway J., 1992, J COMPUT GRAPH STAT, V1, P215
  • [15] Strictly proper scoring rules, prediction, and estimation
    Gneiting, Tilmann
    Raftery, Adrian E.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2007, 102 (477) : 359 - 378
  • [16] GOOD IJ, 1952, J ROY STAT SOC B, V14, P107
  • [17] Split Samples and Design Sensitivity in Observational Studies
    Heller, Ruth
    Rosenbaum, Paul R.
    Small, Dylan S.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2009, 104 (487) : 1090 - 1101
  • [18] THE ANALYSIS OF TRANSFORMED DATA
    HINKLEY, DV
    RUNGER, G
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1984, 79 (386) : 302 - 309
  • [19] HIRSCH RP, 1991, BIOMETRICS, V47, P1193
  • [20] Frequentist prediction intervals and predictive distributions
    Lawless, JF
    Fredette, M
    [J]. BIOMETRIKA, 2005, 92 (03) : 529 - 542