Prediction and model evaluation for space-time data

被引:3
作者
Watson, G. L. [1 ]
Reid, C. E. [2 ]
Jerrett, M. [3 ]
Telesca, D. [1 ,4 ]
机构
[1] Univ Calif Los Angeles, Dept Biostat, Los Angeles, CA USA
[2] Univ Colorado, Dept Geog, Boulder, CO USA
[3] Univ Calif Los Angeles, Dept Environm Hlth Sci, Los Angeles, CA USA
[4] UCLA Fielding Sch Publ Hlth, Box 177220,Suite 51-254 CHS, Los Angeles, CA 90095 USA
关键词
Cross validation; generalization error; machine learning; point process; space-time data; FINE PARTICULATE MATTER; CROSS-VALIDATION; SPATIOTEMPORAL PREDICTION; PM2.5; CONCENTRATIONS; WEIGHTED REGRESSION; MEASUREMENT ERROR; SELECTION;
D O I
10.1080/02664763.2023.2252208
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Evaluation metrics for prediction error, model selection and model averaging on space-time data are understudied and poorly understood. The absence of independent replication makes prediction ambiguous as a concept and renders evaluation procedures developed for independent data inappropriate for most space-time prediction problems. Motivated by air pollution data collected during California wildfires in 2008, this manuscript attempts a formalization of the true prediction error associated with spatial interpolation. We investigate a variety of cross-validation (CV) procedures employing both simulations and case studies to provide insight into the nature of the estimand targeted by alternative data partition strategies. Consistent with recent best practice, we find that location-based cross-validation is appropriate for estimating spatial interpolation error as in our analysis of the California wildfire data. Interestingly, commonly held notions of bias-variance trade-off of CV fold size do not trivially apply to dependent data, and we recommend leave-one-location-out (LOLO) CV as the preferred prediction error metric for spatial interpolation.
引用
收藏
页码:2007 / 2024
页数:18
相关论文
共 35 条
  • [1] A comparison of statistical and machine learning methods for creating national daily maps of ambient PM2.5 concentration
    Berrocal, Veronica J.
    Guan, Yawen
    Muyskens, Amanda
    Wang, Haoyu
    Reich, Brian J.
    Mulholland, James A.
    Chang, Howard H.
    [J]. ATMOSPHERIC ENVIRONMENT, 2020, 222
  • [2] A Spatio-Temporal Downscaler for Output From Numerical Models
    Berrocal, Veronica J.
    Gelfand, Alan E.
    Holland, David M.
    [J]. JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS, 2010, 15 (02) : 176 - 197
  • [3] SOME ASYMPTOTIC THEORY FOR THE BOOTSTRAP
    BICKEL, PJ
    FREEDMAN, DA
    [J]. ANNALS OF STATISTICS, 1981, 9 (06) : 1196 - 1217
  • [4] Predicting Daily Urban Fine Particulate Matter Concentrations Using a Random Forest Model
    Brokamp, Cole
    Jandarov, Roman
    Hossain, Monir
    Ryan, Patrick
    [J]. ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2018, 52 (07) : 4173 - 4179
  • [5] Cressie N., 2015, STAT SPATIAL DATA, DOI DOI 10.1002/9781119115151
  • [6] Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets
    Datta, Abhirup
    Banerjee, Sudipto
    Finley, Andrew O.
    Gelfand, Alan E.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2016, 111 (514) : 800 - 812
  • [7] Geostatistical inference under preferential sampling
    Diggle, Peter J.
    Menezes, Raquel
    Su, Ting-li
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2010, 59 : 191 - 232
  • [8] DUDOIT S., 2005, Statistical Methodology, V2, P131, DOI DOI 10.1016/J.STAMET.2005.02.003
  • [9] Improvements on cross-validation: The .632+ bootstrap method
    Efron, B
    Tibshirani, R
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1997, 92 (438) : 548 - 560
  • [10] Efron B, 2004, J AM STAT ASSOC, V99, P619, DOI 10.1198/016214504000000692