MCMC for Imbalanced Categorical Data

被引:42
作者
Johndrow, James E. [1 ]
Smith, Aaron [2 ]
Pillai, Natesh [3 ]
Dunson, David B. [4 ]
机构
[1] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
[2] Univ Ottawa, Dept Math & Stat, Ottawa, ON, Canada
[3] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA
[4] Duke Univ, Dept Stat Sci, Durham, NC USA
基金
美国国家科学基金会; 加拿大自然科学与工程研究理事会;
关键词
Bayesian; Data augmentation; Gibbs sampling; Large sample; Markov chain Monte Carlo; Mixing; MONTE-CARLO ALGORITHMS; BAYESIAN-INFERENCE; CONVERGENCE-RATES; COMPUTATIONAL-COMPLEXITY; MODELS; DISTRIBUTIONS; BINARY; BOUNDS;
D O I
10.1080/01621459.2018.1505626
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Many modern applications collect highly imbalanced categorical data, with some categories relatively rare. Bayesian hierarchical models combat data sparsity by borrowing information, while also quantifying uncertainty. However, posterior computation presents a fundamental barrier to routine use; a single class of algorithms does not work well in all settings and practitioners waste time trying different types of Markov chain Monte Carlo (MCMC) approaches. This article was motivated by an application to quantitative advertising in which we encountered extremely poor computational performance for data augmentation MCMC algorithms but obtained excellent performance for adaptive Metropolis. To obtain a deeper understanding of this behavior, we derive theoretical results on the computational complexity of commonly used data augmentation algorithms and the Random Walk Metropolis algorithm for highly imbalanced binary data. In this regime, our results show computational complexity of Metropolis is logarithmic in sample size, while data augmentation is polynomial in sample size. The root cause of this poor performance of data augmentation is a discrepancy between the rates at which the target density and MCMC step sizes concentrate. Our methods also show that MCMC algorithms that exhibit a similar discrepancy will fail in large samples-a result with substantial practical impact.
引用
收藏
页码:1394 / 1403
页数:10
相关论文
共 36 条
  • [1] BAYESIAN-ANALYSIS OF BINARY AND POLYCHOTOMOUS RESPONSE DATA
    ALBERT, JH
    CHIB, S
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (422) : 669 - 679
  • [2] [Anonymous], 2015, ARXIV150803387
  • [3] [Anonymous], 2006, Texts in Statistical Science
  • [4] [Anonymous], ARXIV170303123
  • [5] ON THE COMPUTATIONAL COMPLEXITY OF MCMC-BASED ESTIMATORS IN LARGE SAMPLES
    Belloni, Alexandre
    Chernozhukov, Victor
    [J]. ANNALS OF STATISTICS, 2009, 37 (04) : 2011 - 2055
  • [6] Stan: A Probabilistic Programming Language
    Carpenter, Bob
    Gelman, Andrew
    Hoffman, Matthew D.
    Lee, Daniel
    Goodrich, Ben
    Betancourt, Michael
    Brubaker, Marcus A.
    Guo, Jiqiang
    Li, Peter
    Riddell, Allen
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2017, 76 (01): : 1 - 29
  • [7] The Polya-Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic
    Choi, Hee Min
    Hobert, James P.
    [J]. ELECTRONIC JOURNAL OF STATISTICS, 2013, 7 : 2054 - 2064
  • [8] Damien P, 1999, J ROY STAT SOC B, V61, P331
  • [9] Gibbs sampling, exponential families and orthogonal polynomials
    Diaconis, Persi
    Khare, Kshitij
    Saloff-Coste, Laurent
    [J]. STATISTICAL SCIENCE, 2008, 23 (02) : 151 - 178
  • [10] Frühwirth-Schnatter S, 2010, STATISTICAL MODELLING AND REGRESSION STRUCTURES, P111, DOI 10.1007/978-3-7908-2413-1_7