MCMC for Imbalanced Categorical Data

被引：42

作者：

Johndrow, James E. ^{[1
]}

Smith, Aaron ^{[2
]}

Pillai, Natesh ^{[3
]}

Dunson, David B. ^{[4
]}

机构：

[1] Stanford Univ, Dept Stat, Stanford, CA 94305 USA

[2] Univ Ottawa, Dept Math & Stat, Ottawa, ON, Canada

[3] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA

[4] Duke Univ, Dept Stat Sci, Durham, NC USA

来源：

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION | 2019年 / 114卷 / 527期

基金：

美国国家科学基金会; 加拿大自然科学与工程研究理事会;

关键词：

Bayesian; Data augmentation; Gibbs sampling; Large sample; Markov chain Monte Carlo; Mixing; MONTE-CARLO ALGORITHMS; BAYESIAN-INFERENCE; CONVERGENCE-RATES; COMPUTATIONAL-COMPLEXITY; MODELS; DISTRIBUTIONS; BINARY; BOUNDS;

D O I：

10.1080/01621459.2018.1505626

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Many modern applications collect highly imbalanced categorical data, with some categories relatively rare. Bayesian hierarchical models combat data sparsity by borrowing information, while also quantifying uncertainty. However, posterior computation presents a fundamental barrier to routine use; a single class of algorithms does not work well in all settings and practitioners waste time trying different types of Markov chain Monte Carlo (MCMC) approaches. This article was motivated by an application to quantitative advertising in which we encountered extremely poor computational performance for data augmentation MCMC algorithms but obtained excellent performance for adaptive Metropolis. To obtain a deeper understanding of this behavior, we derive theoretical results on the computational complexity of commonly used data augmentation algorithms and the Random Walk Metropolis algorithm for highly imbalanced binary data. In this regime, our results show computational complexity of Metropolis is logarithmic in sample size, while data augmentation is polynomial in sample size. The root cause of this poor performance of data augmentation is a discrepancy between the rates at which the target density and MCMC step sizes concentrate. Our methods also show that MCMC algorithms that exhibit a similar discrepancy will fail in large samples-a result with substantial practical impact.

引用

页码：1394 / 1403

页数：10

共 36 条

[1] BAYESIAN-ANALYSIS OF BINARY AND POLYCHOTOMOUS RESPONSE DATA
ALBERT, JH
CHIB, S
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (422) : 669 - 679
[2] [Anonymous], 2015, ARXIV150803387
[3] [Anonymous], 2006, Texts in Statistical Science
[4] [Anonymous], ARXIV170303123
[5] ON THE COMPUTATIONAL COMPLEXITY OF MCMC-BASED ESTIMATORS IN LARGE SAMPLES
Belloni, Alexandre
Chernozhukov, Victor
[J]. ANNALS OF STATISTICS, 2009, 37 (04) : 2011 - 2055
[6] Stan: A Probabilistic Programming Language
Carpenter, Bob
Gelman, Andrew
Hoffman, Matthew D.
Lee, Daniel
Goodrich, Ben
Betancourt, Michael
Brubaker, Marcus A.
Guo, Jiqiang
Li, Peter
Riddell, Allen
[J]. JOURNAL OF STATISTICAL SOFTWARE, 2017, 76 (01): : 1 - 29
[7] The Polya-Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic
Choi, Hee Min
Hobert, James P.
[J]. ELECTRONIC JOURNAL OF STATISTICS, 2013, 7 : 2054 - 2064
[8] Damien P, 1999, J ROY STAT SOC B, V61, P331
[9] Gibbs sampling, exponential families and orthogonal polynomials
Diaconis, Persi
Khare, Kshitij
Saloff-Coste, Laurent
[J]. STATISTICAL SCIENCE, 2008, 23 (02) : 151 - 178
[10] Frühwirth-Schnatter S, 2010, STATISTICAL MODELLING AND REGRESSION STRUCTURES, P111, DOI 10.1007/978-3-7908-2413-1_7

← 1 2 3 4 →