A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands

被引:25
作者
Viljanen, Markus [1 ]
Meijerink, Lotta [1 ]
Zwakhals, Laurens [1 ]
van de Kassteele, Jan [1 ]
机构
[1] Natl Inst Publ Hlth & Environm RIVM, POB 1, NL-3720 BA Bilthoven, Netherlands
关键词
Small area estimation; Machine learning; Extreme gradient boosting; Health and welfare;
D O I
10.1186/s12942-022-00304-5
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Background Local policymakers require information about public health, housing and well-being at small geographical areas. A municipality can for example use this information to organize targeted activities with the aim of improving the well-being of their residents. Surveys are often used to gather data, but many neighborhoods can have only few or even zero respondents. In that case, estimating the status of the local population directly from survey responses is prone to be unreliable. Methods Small Area Estimation (SAE) is a technique to provide estimates at small geographical levels with only few or even zero respondents. In classical individual-level SAE, a complex statistical regression model is fitted to the survey responses by using auxiliary administrative data for the population as predictors, the missing responses are then predicted and aggregated to the desired geographical level. In this paper we compare gradient boosted trees (XGBoost), a well-known machine learning technique, to a structured additive regression model (STAR) designed for the specific problem of estimating public health and well-being in the whole population of the Netherlands. Results We compare the accuracy and performance of these models using out-of-sample predictions with five-fold Cross Validation (5CV). We do this for three data sets of different sample sizes and outcome types. Compared to the STAR model, gradient boosted trees are able to improve both the accuracy of the predictions and the total time taken to get these predictions. Even though the models appear quite similar in overall accuracy, the small area predictions at neighborhood level sometimes differ significantly. It may therefore make sense to pursue slightly more accurate models for better predictions into small areas. However, one of the biggest benefits is that XGBoost does not require prior knowledge or model specification. Data preparation and modelling is much easier, since the method automatically handles missing data, non-linear responses, interactions and accounts for spatial correlation structures. Conclusions In this paper we provide new nationwide estimates of health, housing and well-being indicators at neighborhood level in the Netherlands, see 'Online materials'. We demonstrate that machine learning provides a good alternative to complex statistical regression modelling for small area estimation in terms of accuracy, robustness, speed and data preparation. These results can be used to make appropriate policy decisions at a local level and make recommendations about which estimation methods are beneficial in terms of accuracy, time and budget constraints.
引用
收藏
页数:18
相关论文
共 21 条
[1]   Methods for Estimating Population Density in Data-Limited Areas: Evaluating Regression and Tree-Based Models in Peru [J].
Anderson, Weston ;
Guikema, Seth ;
Zaitchik, Ben ;
Pan, William .
PLOS ONE, 2014, 9 (07)
[2]  
Fahrmeir L., 2013, Complex analysis, DOI [DOI 10.1007/978-3-642-59273-7, DOI 10.1007/978]
[3]   An introduction to ROC analysis [J].
Fawcett, Tom .
PATTERN RECOGNITION LETTERS, 2006, 27 (08) :861-874
[4]  
Fernández-Delgado M, 2014, J MACH LEARN RES, V15, P3133
[5]   Greedy function approximation: A gradient boosting machine [J].
Friedman, JH .
ANNALS OF STATISTICS, 2001, 29 (05) :1189-1232
[6]  
Hastie T., 2009, Elements of StatisticalLearning: Data Mining, Inference, and Prediction, V2, DOI DOI 10.1007/978-0-387-84858-71647L
[7]  
Hiemstra M., 2020, OPBOUW INSTRUCTIE TO
[8]  
Janssen S., 2019, WOON 2018 ONDERZOEKS, P24
[9]   Using machine learning and small area estimation to predict building-level municipal solid waste generation in cities [J].
Kontokosta, Constantine E. ;
Hong, Boyeong ;
Johnson, Nicholas E. ;
Starobin, Daniel .
COMPUTERS ENVIRONMENT AND URBAN SYSTEMS, 2018, 70 :151-162
[10]   SMALL AREA ESTIMATION OF THE HOMELESS IN LOS ANGELES: AN APPLICATION OF COST-SENSITIVE STOCHASTIC GRADIENT BOOSTING [J].
Kriegler, Brian ;
Berk, Richard .
ANNALS OF APPLIED STATISTICS, 2010, 4 (03) :1234-1255