Consistent estimation of the number of communities in stochastic block models using cross-validation

被引:3
作者
Qin, Jining [1 ]
Lei, Jing [2 ]
机构
[1] Two Sigma Investments LP, New York, NY 10013 USA
[2] Carnegie Mellon Univ, Dept Stat & Data Sci, Baker Hall 132, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
cross-validation; model selection; network data; stochastic block models; BAYESIAN-INFERENCE; SELECTION;
D O I
10.1002/sta4.426
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The stochastic block model (SBM) and its variants constitute an important family of probabilistic tools for studying network data. There is a rich literature on methods for estimating block labels and model parameters of stochastic block models. Most of these studies require the number of communities K as an input, making the estimation of K an important problem. Cross-validation is a natural option for this problem since it is a widely used generic method for evaluating model fitting. However, cross-validation is known to be inconsistent and prone to overfitting unless impractical split ratios are used. Cross-validation with confidence (CVC) is proposed with better theoretical guarantees in conventional settings. We study the properties of CVC for stochastic block models. Our theoretical studies show that CVC, unlike the standard cross-validation, can consistently pick the optimal K under suitable conditions. We implement this method and check its performance against other established methods on both synthetic and real datasets.
引用
收藏
页数:10
相关论文
共 29 条
[1]  
[Anonymous], ARXIV150700827
[2]   Hypothesis testing for automated community detection in networks [J].
Bickel, Peter J. ;
Sarkar, Purnamrita .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2016, 78 (01) :253-273
[3]   Network Cross-Validation for Determining the Number of Communities in Network Data [J].
Chen, Kehui ;
Lei, Jing .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2018, 113 (521) :241-251
[4]   GAUSSIAN APPROXIMATIONS AND MULTIPLIER BOOTSTRAP FOR MAXIMA OF SUMS OF HIGH-DIMENSIONAL RANDOM VECTORS [J].
Chernozhukov, Victor ;
Chetverikov, Denis ;
Kato, Kengo .
ANNALS OF STATISTICS, 2013, 41 (06) :2786-2819
[5]   Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood [J].
Come, Etienne ;
Latouche, Pierre .
STATISTICAL MODELLING, 2015, 15 (06) :564-589
[6]   A mixture model for random graphs [J].
Daudin, J. -J. ;
Picard, F. ;
Robin, S. .
STATISTICS AND COMPUTING, 2008, 18 (02) :173-183
[7]  
Fienberg S.E., 1981, SOCIOL METHODOL, V12, P156, DOI DOI 10.2307/270741
[8]   Community structure in social and biological networks [J].
Girvan, M ;
Newman, MEJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (12) :7821-7826
[9]   Model-based clustering for social networks [J].
Handcock, Mark S. ;
Raftery, Adrian E. ;
Tantrum, Jeremy M. .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2007, 170 :301-322
[10]  
Hoff, 2007, ADV NEURAL INFORM PR, P657