Random Forests for Spatially Dependent Data

被引:47
作者
Saha, Arkajyoti [1 ]
Basu, Sumanta [2 ]
Datta, Abhirup [1 ]
机构
[1] Johns Hopkins Univ, Dept Biostat, Baltimore, MD 21205 USA
[2] Cornell Univ, Dept Stat & Data Sci, Ithaca, NY USA
关键词
Gaussian processes; Generalized least squares; Random forests; Spatial; GAUSSIAN PROCESS MODELS; UNIFORM LAWS; UNCERTAINTY;
D O I
10.1080/01621459.2021.1950003
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Spatial linear mixed-models, consisting of a linear covariate effect and a Gaussian process (GP) distributed spatial random effect, are widely used for analyses of geospatial data. We consider the setting where the covariate effect is nonlinear. Random forests (RF) are popular for estimating nonlinear functions but applications of RF for spatial data have often ignored the spatial correlation. We show that this impacts the performance of RF adversely. We propose RF-GLS, a novel and well-principled extension of RF, for estimating nonlinear covariate effects in spatial mixed models where the spatial correlation is modeled using GP. RF-GLS extends RF in the same way generalized least squares (GLS) fundamentally extends ordinary least squares (OLS) to accommodate for dependence in linear models. RF becomes a special case of RF-GLS, and is substantially outperformed by RF-GLS for both estimation and prediction across extensive numerical experiments with spatially correlated data. RF-GLS can be used for functional estimation in other types of dependent data like time series. We prove consistency of RF-GLS for beta-mixing dependent error processes that include the popular spatial Matern GP. As a byproduct, we also establish, to our knowledge, the first consistency result for RF under dependence. We establish results of independent importance, including a general consistency result of GLS optimizers of data-driven function classes, and a uniform law of large number under beta-mixing dependence with weaker assumptions. These new tools can be potentially useful for asymptotic analysis of other GLS-style estimators in nonparametric regression with dependent data.
引用
收藏
页码:665 / 683
页数:19
相关论文
共 52 条
[1]   Probabilistic Forecasts of Mesoscale Convective System Initiation Using the Random Forest Data Mining Technique [J].
Ahijevych, David ;
Pinto, James O. ;
Williams, John K. ;
Steiner, Matthias .
WEATHER AND FORECASTING, 2016, 31 (02) :581-599
[2]  
Banerjee S., 2014, Hierarchical Modelling and Analysis for spatial Data
[3]   Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions [J].
Bradley, Richard C. .
PROBABILITY SURVEYS, 2005, 2 :107-144
[4]  
Breiman L, 2001, MACH LEARN, V45, P5, DOI [10.1186/s12859-018-2419-4, 10.3322/caac.21834]
[5]   Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[6]  
Breiman L., 1984, CLASSIFICATION REGRE, DOI DOI 10.1201/9781315139470
[7]   Mixing and moment properties of various GARCH and stochastic volatility models [J].
Carrasco, M ;
Chen, XH .
ECONOMETRIC THEORY, 2002, 18 (01) :17-39
[8]   On perturbation bounds for orthogonal projections [J].
Chen, Yan Mei ;
Chen, Xiao Shan ;
Li, Wen .
NUMERICAL ALGORITHMS, 2016, 73 (02) :433-444
[9]   NONSEPARABLE DYNAMIC NEAREST NEIGHBOR GAUSSIAN PROCESS MODELS FOR LARGE SPATIO-TEMPORAL DATA WITH AN APPLICATION TO PARTICULATE MATTER ANALYSIS [J].
Datta, Abhirup ;
Banerjee, Sudipto ;
Finley, Andrew O. ;
Hamm, Nicholas A. S. ;
Schaap, Martijn .
ANNALS OF APPLIED STATISTICS, 2016, 10 (03) :1286-1316
[10]   On nearest-neighbor Gaussian process models for massive spatial data [J].
Datta, Abhirup ;
Banerjee, Sudipto ;
Finley, Andrew O. ;
Gelfand, Alan E. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2016, 8 (05) :162-171