Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era

被引:15
作者
Lee, Jarod Y. L. [1 ,2 ]
Brown, James J. [1 ,2 ]
Ryan, Louise M. [1 ,2 ,3 ]
机构
[1] Univ Technol Sydney, Sch Math & Phys Sci, Ultimo, NSW 2007, Australia
[2] Univ Melbourne, Australian Res Council, Ctr Excellence Math & Stat Frontiers, Parkville, Vic 3010, Australia
[3] Harvard TH Chan Sch Publ Hlth, Dept Biostat, Boston, MA USA
关键词
Big data; Distributed database; Divide and recombine; Generalized linear mixed model; Multilevel model; Privacy; MONTE-CARLO; COMPRESSION;
D O I
10.1080/00031305.2016.1255659
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The big data era demands new statistical analysis paradigms, since traditional methods often break down when datasets are too large to fit on a single desktop computer. Divide and Recombine (D&R) is becoming a popular approach for big data analysis, where results are combined over subanalyses performed in separate data subsets. In this article, we consider situations where unit record data cannot be made available by data custodians due to privacy concerns, and explore the concept of statistical sufficiency and summary statistics for model fitting. The resulting approach represents a type of D&R strategy, which we refer to as summary statistics D & R; as opposed to the standard approach, which we refer to as horizontal D&R. We demonstrate the concept via an extended Gamma-Poisson model, where summary statistics are extracted from different databases and incorporated directly into the fitting algorithm without having to combine unit record data. By exploiting the natural hierarchy of data, our approach has major benefits in terms of privacy protection. Incorporating the proposed modelling framework into data extraction tools such as TableBuilder by the Australian Bureau of Statistics allows for potential analysis at a finer geographical level, which we illustrate with a multilevel analysis of the Australian unemployment data. Supplementary materials for this article are available online.
引用
收藏
页码:202 / 208
页数:7
相关论文
共 37 条
  • [1] [Anonymous], 2006, Symbolic Data Analysis: Conceptual Statistics and Data Mining
  • [2] [Anonymous], 2012, NSDI
  • [3] [Anonymous], 2016, Handbook of Big Data
  • [4] [Anonymous], 2009, Hadoop: The Definitive Guide
  • [5] Fitting Linear Mixed-Effects Models Using lme4
    Bates, Douglas
    Maechler, Martin
    Bolker, Benjamin M.
    Walker, Steven C.
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2015, 67 (01): : 1 - 48
  • [6] Distributed optimization and statistical learning via the alternating direction method of multipliers
    Boyd S.
    Parikh N.
    Chu E.
    Peleato B.
    Eckstein J.
    [J]. Foundations and Trends in Machine Learning, 2010, 3 (01): : 1 - 122
  • [7] Casella G., 2002, STAT INFERENCE
  • [8] A SPLIT-AND-CONQUER APPROACH FOR ANALYSIS OF EXTRAORDINARILY LARGE DATA
    Chen, Xueying
    Xie, Min-ge
    [J]. STATISTICA SINICA, 2014, 24 (04) : 1655 - 1684
  • [9] Chen YX, 2006, IEEE T KNOWL DATA EN, V18, P1585, DOI 10.1109/TKDE.2006.196
  • [10] Hierarchical Poisson regression modeling
    Christiansen, CL
    Morris, CN
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1997, 92 (438) : 618 - 632