Leveraging Predictive Modelling from Multiple Sources of Big Data to Improve Sample Efficiency and Reduce Survey Nonresponse Error

被引:1
作者
Dutwin, David [1 ,3 ]
Coyle, Patrick [2 ]
Bilgen, Ipek [1 ]
English, Ned [2 ]
机构
[1] Univ Chicago, AmeriSpeak, NORC, Chicago, IL USA
[2] Univ Chicago, NORC, Chicago, IL USA
[3] Univ Chicago, NORC, 55 E Monroe St, Chicago, IL 60603 USA
关键词
Big data; Machine learning; Sampling;
D O I
10.1093/jssam/smad016
中图分类号
O1 [数学]; C [社会科学总论];
学科分类号
03 ; 0303 ; 0701 ; 070101 ;
摘要
Big data has been fruitfully leveraged as a supplement for survey data-and sometimes as its replacement-and in the best of worlds, as a "force multiplier" to improve survey analytics and insight. We detail a use case, the big data classifier (BDC), as a replacement to the more traditional methods of targeting households in survey sampling for given specific household and personal attributes. Much like geographic targeting and the use of commercial vendor flags, we detail the ability of BDCs to predict the likelihood that any given household is, for example, one that contains a child or someone who is Hispanic. We specifically build 15 BDCs with the combined data from a large nationally representative probability-based panel and a range of big data from public and private sources, and then assess the effectiveness of these BDCs to successfully predict their range of predicted attributes across three large survey datasets. For each BDC and each data application, we compare the relative effectiveness of the BDCs against historical sample targeting techniques of geographic clustering and vendor flags. Overall, BDCs offer a modest improvement in their ability to target subpopulations. We find classes of predictions that are consistently more effective, and others where the BDCs are on par with vendor flagging, though always superior to geographic clustering. We present some of the relative strengths and weaknesses of BDCs as a new method to identify and subsequently sample low incidence and other populations.
引用
收藏
页码:435 / 457
页数:23
相关论文
共 56 条
[1]  
Amaya A., 2010, SURVEY PRACTICE, V3, P1
[2]  
[Anonymous], 2020, GEN SOCIAL SURVEY
[3]  
[Anonymous], 2016, Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys, V9th
[4]   Using Auxiliary Sample Frame Information for Optimum Sampling of Rare Populations [J].
Barron, Martin ;
Davern, Michael ;
Montgomery, Robert ;
Tao, Xian ;
Wolter, Kirk M. ;
Zeng, Wei ;
Dorell, Christina ;
Black, Carla .
JOURNAL OF OFFICIAL STATISTICS, 2015, 31 (04) :545-557
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]  
California Health Interview Survey, 2019, CHIS 2017 2018 METH
[7]  
Chen T., 2022, R PACKAGE VERSION 1
[8]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[9]  
COCHRAN WG, 1961, B INT STATIST INST, V38, P345
[10]  
Czajka JohnL., 2016, Declining response rates in federal surveys: trends and implications