Comparing Inference Methods for Non-probability Samples

被引:42
作者
Buelens, Bart [1 ]
Burger, Joep [1 ]
van den Brakel, Jan A. [1 ,2 ]
机构
[1] Stat Netherlands, POB 4481, NL-6401 CZ Heerlen, Netherlands
[2] Maastricht Univ, Sch Business & Econ, POB 616, NL-6200 MD Maastricht, Netherlands
关键词
Algorithmic inference; big data; predictive modelling; pseudo-design-based estimation; DESIGN-BASED ANALYSIS; BIG DATA; OFFICIAL STATISTICS; PROPENSITY SCORE; WEB SURVEYS; ESTIMATORS;
D O I
10.1111/insr.12253
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Social and economic scientists are tempted to use emerging data sources like big data to compile information about finite populations as an alternative for traditional survey samples. These data sources generally cover an unknown part of the population of interest. Simply assuming that analyses made on these data are applicable to larger populations is wrong. The mere volume of data provides no guarantee for valid inference. Tackling this problem with methods originally developed for probability sampling is possible but shown here to be limited. A wider range of model-based predictive inference methods proposed in the literature are reviewed and evaluated in a simulation study using real-world data on annual mileages by vehicles. We propose to extend this predictive inference framework with machine learning methods for inference from samples that are generated through mechanisms other than random sampling from a target population. Describing economies and societies using sensor data, internet search data, social media and voluntary opt-in panels is cost-effective and timely compared with traditional surveys but requires an extended inference framework as proposed in this article.
引用
收藏
页码:322 / 343
页数:22
相关论文
共 90 条
[1]   Large sample properties of matching estimators for average treatment effects [J].
Abadie, A ;
Imbens, GW .
ECONOMETRICA, 2006, 74 (01) :235-267
[2]  
Adler D., 2008, R PACKAGE VERSION, V2
[3]  
[Anonymous], SPATIAL MICROSIMULAT
[4]  
[Anonymous], 2014, P COLING 2014 25 INT
[5]  
[Anonymous], 1961, The Annals of Mathematical Statistics, DOI [DOI 10.1214/AOMS/1177705148, 10.1214/aoms/1177705148]
[6]  
[Anonymous], 1999, The analysis of variance
[7]  
[Anonymous], 1991, Surv Methodol
[8]  
Baker R., 2013, Journal of Survey Statistics and Methodology, V1, P90, DOI [DOI 10.1093/JSSAM/SMT008, https:/doi.org/10.1093/jssam/smt008, https://doi.org/10.1093/jssam/smt008]
[9]   Research Synthesis [J].
Baker, Reg ;
Blumberg, Stephen J. ;
Brick, J. Michael ;
Couper, Mick P. ;
Courtright, Melanie ;
Dennis, J. Michael ;
Dillman, Don ;
Frankel, Martin R. ;
Garland, Philip ;
Groves, Robert M. ;
Kennedy, Courtney ;
Krosnick, Jon ;
Lavrakas, Paul J. ;
Lee, Sunghee ;
Link, Michael ;
Piekarski, Linda ;
Rao, Kumar ;
Thomas, Randall K. ;
Zahs, Dan .
PUBLIC OPINION QUARTERLY, 2010, 74 (04) :711-781
[10]   A Two-Step Procedure to Measure Representativeness of Internet Data Sources [J].
Beresewicz, Maciej .
INTERNATIONAL STATISTICAL REVIEW, 2017, 85 (03) :473-493