A comparison of statistical and machine learning methods for creating national daily maps of ambient PM2.5 concentration

被引:68
作者
Berrocal, Veronica J. [1 ]
Guan, Yawen [2 ]
Muyskens, Amanda [3 ]
Wang, Haoyu [4 ]
Reich, Brian J. [4 ]
Mulholland, James A. [5 ]
Chang, Howard H. [6 ]
机构
[1] Univ Calif Irvine, Dept Stat, Irvine, CA 92697 USA
[2] Univ Nebraska, Dept Stat, Lincoln, NE USA
[3] Lawrence Livermore Natl Lab, Livermore, CA 94550 USA
[4] SAS, Cary, NC USA
[5] Georgia Inst Technol, Atlanta, GA 30332 USA
[6] Emory Univ, Dept Biostat & Bioinformat, Atlanta, GA 30322 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
LAND-USE REGRESSION; AIR-POLLUTION; PRETERM BIRTH; US STATE; OZONE; POLLUTANTS; MODEL; NOX;
D O I
10.1016/j.atmosenv.2019.117130
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
A typical challenge in air pollution epidemiology is to perform detailed exposure assessment for individuals for which health data are available. To address this problem, in the last few years, substantial research efforts have been placed in developing statistical methods or machine learning techniques to generate estimates of air pollution at fine spatial and temporal scales (daily, usually) with complete coverage. However, it is not clear how much the predicted exposures yielded by the various methods differ, and which method generates more reliable estimates. In this paper, we aim to address this gap by evaluating a variety of exposure modeling approaches, comparing their predictive performance. Using PM2.5 in year 2011 over the continental U.S. as a case study, we generate national maps of ambient PM2.5 concentration using: (i) ordinary least squares and inverse distance weighting; (ii) kriging; (iii) statistical downscaling models, that is, spatial statistical models that use the information contained in air quality model outputs; (iv) land use regression, that is, linear regression modeling approaches that leverage the information in Geographical Information System (GIS) covariates; and (v) machine learning methods, such as neural networks, random forests and support vector regression. We examine the various methods' predictive performance via cross-validation using Root Mean Squared Error, Mean Absolute Deviation, Pearson correlation, and Mean Spatial Pearson Correlation. Additionally, we evaluated whether factors such as, season, urbanicity, and levels of PM2.5 concentration (low, medium or high) affected the performance of the different methods. Overall, statistical methods that explicitly modeled the spatial correlation, e.g. universal kriging and the downscaler model, outperform all the other exposure assessment approaches regardless of season, urbanicity and PM2.5 concentration level. We posit that the better predictive performance of spatial statistical models over machine learning methods is due to the fact that they explicitly account for spatial dependence, thus borrowing information from neighboring observations. In light of our findings, we suggest that future exposure assessment methods for regional PM2.5 incorporate information from neighboring sites when deriving predictions at unsampled locations or attempt to account for spatial dependence.
引用
收藏
页数:14
相关论文
共 44 条
[1]   Spatiotemporal Modeling of Ozone Levels in Quebec (Canada): A Comparison of Kriging, Land-Use Regression (LUR), and Combined Bayesian Maximum Entropy-LUR Approaches [J].
Adam-Poupart, Ariane ;
Brand, Allan ;
Fournier, Michel ;
Jerrett, Michael ;
Smargiassi, Audrey .
ENVIRONMENTAL HEALTH PERSPECTIVES, 2014, 122 (09) :970-976
[2]   Consequences of kriging and land use regression for PM2.5 predictions in epidemiologic analyses: insights into spatial variability using high-resolution satellite data [J].
Alexeeff, Stacey E. ;
Schwartz, Joel ;
Kloog, Itai ;
Chudnovsky, Alexandra ;
Koutrakis, Petros ;
Coull, Brent A. .
JOURNAL OF EXPOSURE SCIENCE AND ENVIRONMENTAL EPIDEMIOLOGY, 2015, 25 (02) :138-144
[3]  
[Anonymous], INTRO STAT LEAMIN
[4]  
[Anonymous], 1986, FDN PARALLEL DISTRIB
[5]  
[Anonymous], 2016, DEEP LEARNING
[6]  
[Anonymous], 2005, McGrwa-Hill international edition
[7]  
[Anonymous], 2015, Tech. Rep.
[8]   A Spatio-Temporal Downscaler for Output From Numerical Models [J].
Berrocal, Veronica J. ;
Gelfand, Alan E. ;
Holland, David M. .
JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS, 2010, 15 (02) :176-197
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]  
Breiman L., 2001, IEEE Trans. Broadcast., V45, P5