Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era

被引：15

作者：

Lee, Jarod Y. L. ^{[1
,2
]}

Brown, James J. ^{[1
,2
]}

Ryan, Louise M. ^{[1
,2
,3
]}

机构：

[1] Univ Technol Sydney, Sch Math & Phys Sci, Ultimo, NSW 2007, Australia

[2] Univ Melbourne, Australian Res Council, Ctr Excellence Math & Stat Frontiers, Parkville, Vic 3010, Australia

[3] Harvard TH Chan Sch Publ Hlth, Dept Biostat, Boston, MA USA

来源：

AMERICAN STATISTICIAN | 2017年 / 71卷 / 03期

关键词：

Big data; Distributed database; Divide and recombine; Generalized linear mixed model; Multilevel model; Privacy; MONTE-CARLO; COMPRESSION;

D O I：

10.1080/00031305.2016.1255659

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

The big data era demands new statistical analysis paradigms, since traditional methods often break down when datasets are too large to fit on a single desktop computer. Divide and Recombine (D&R) is becoming a popular approach for big data analysis, where results are combined over subanalyses performed in separate data subsets. In this article, we consider situations where unit record data cannot be made available by data custodians due to privacy concerns, and explore the concept of statistical sufficiency and summary statistics for model fitting. The resulting approach represents a type of D&R strategy, which we refer to as summary statistics D & R; as opposed to the standard approach, which we refer to as horizontal D&R. We demonstrate the concept via an extended Gamma-Poisson model, where summary statistics are extracted from different databases and incorporated directly into the fitting algorithm without having to combine unit record data. By exploiting the natural hierarchy of data, our approach has major benefits in terms of privacy protection. Incorporating the proposed modelling framework into data extraction tools such as TableBuilder by the Australian Bureau of Statistics allows for potential analysis at a finer geographical level, which we illustrate with a multilevel analysis of the Australian unemployment data. Supplementary materials for this article are available online.

引用

页码：202 / 208

页数：7

共 37 条

[1] [Anonymous], 2006, Symbolic Data Analysis: Conceptual Statistics and Data Mining
[2] [Anonymous], 2012, NSDI
[3] [Anonymous], 2016, Handbook of Big Data
[4] [Anonymous], 2009, Hadoop: The Definitive Guide
[5] Fitting Linear Mixed-Effects Models Using lme4
Bates, Douglas
Maechler, Martin
Bolker, Benjamin M.
Walker, Steven C.
[J]. JOURNAL OF STATISTICAL SOFTWARE, 2015, 67 (01): : 1 - 48
[6] Distributed optimization and statistical learning via the alternating direction method of multipliers
Boyd S.
Parikh N.
Chu E.
Peleato B.
Eckstein J.
[J]. Foundations and Trends in Machine Learning, 2010, 3 (01): : 1 - 122
[7] Casella G., 2002, STAT INFERENCE
[8] A SPLIT-AND-CONQUER APPROACH FOR ANALYSIS OF EXTRAORDINARILY LARGE DATA
Chen, Xueying
Xie, Min-ge
[J]. STATISTICA SINICA, 2014, 24 (04) : 1655 - 1684
[9] Chen YX, 2006, IEEE T KNOWL DATA EN, V18, P1585, DOI 10.1109/TKDE.2006.196
[10] Hierarchical Poisson regression modeling
Christiansen, CL
Morris, CN
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1997, 92 (438) : 618 - 632

← 1 2 3 4 →