CATEGORICAL DATA FUSION USING AUXILIARY INFORMATION

被引:16
作者
Fosdick, Bailey K. [1 ]
DeYoreo, Maria [2 ,3 ]
Reiter, Jerome P. [2 ,3 ]
机构
[1] Colorado State Univ, Dept Stat, 102 Stat Bldg, Ft Collins, CO 80523 USA
[2] Duke Univ, Dept Stat Sci, Box 90251, Durham, NC 27706 USA
[3] Duke Univ, Dept Stat Sci, BOX 90251, Durham, NC 27708 USA
基金
美国国家科学基金会;
关键词
Imputation; integration; latent class; matching; MULTIPLE IMPUTATIONS; FILE CONCATENATION; ADJUSTED WEIGHTS; DIRICHLET; MODELS; PRIORS;
D O I
10.1214/16-AOAS925
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In data fusion, analysts seek to combine information from two databases comprised of disjoint sets of individuals, in which some variables appear in both databases and other variables appear in only one database. Most data fusion techniques rely on variants of conditional independence assumptions. When inappropriate, these assumptions can result in unreliable inferences. We propose a data fusion technique that allows analysts to easily incorporate auxiliary information on the dependence structure of variables not observed jointly; we refer to this auxiliary information as glue. With this technique, we fuse two marketing surveys from the book publisher HarperCollins using glue from the online, rapid-response polling company CivicScience. The fused data enable estimation of associations between people's preferences for authors and for learning about new books. The analysis also serves as a case study on the potential for using online surveys to aid data fusion.
引用
收藏
页码:1907 / 1929
页数:23
相关论文
共 31 条
[1]  
[Anonymous], STAT MATCHING THEORY
[2]  
D'ORAZIO M., 2002, RIV STAT UFFICIALE, V1, P5
[3]   Nonparametric Bayes Modeling of Multivariate Categorical Data [J].
Dunson, David B. ;
Xing, Chuanhua .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2009, 104 (487) :1042-1051
[4]  
FOSDICK B., 2016, CATEGORICAL DATA F S, DOI [10.1214/16-AOAS925SUPP., DOI 10.1214/16-AOAS925SUPP]
[5]   On choosing and bounding probability metrics [J].
Gibbs, AL ;
Su, FE .
INTERNATIONAL STATISTICAL REVIEW, 2002, 70 (03) :419-435
[6]   A direct approach to data fusion [J].
Gilula, Z ;
McCulloch, RE ;
Rossi, PE .
JOURNAL OF MARKETING RESEARCH, 2006, 43 (01) :73-83
[7]   Multi level categorical data fusion using partially fused data [J].
Gilula, Zvi ;
McCulloch, Robert .
QME-QUANTITATIVE MARKETING AND ECONOMICS, 2013, 11 (03) :353-377
[8]  
GOODMAN LA, 1974, BIOMETRIKA, V61, P215, DOI 10.1093/biomet/61.2.215
[9]   Markov chain Monte Carlo in approximate Dirichlet and beta two-parameter process hierarchical models [J].
Ishwaran, H ;
Zarepour, M .
BIOMETRIKA, 2000, 87 (02) :371-390
[10]   Gibbs sampling methods for stick-breaking priors [J].
Ishwaran, H ;
James, LF .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (453) :161-173