Optimal Estimation of the Number of Network Communities

被引:24
作者
Jin, Jiashun [1 ]
Ke, Zheng Tracy [2 ]
Luo, Shengming [1 ]
Wang, Minzhe [3 ]
机构
[1] Carnegie Mellon Univ, Dept Stat, Baker Hall, Pittsburgh, PA 15213 USA
[2] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA
[3] Jump Trading Grp, Chicago, IL USA
关键词
Community detection; k-means; Lower bound; Nonsplitting property (NSP); Over-fitting; Under-fitting; MODEL;
D O I
10.1080/01621459.2022.2035736
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In network analysis, how to estimate the number of communities K is a fundamental problem. We consider a broad setting where we allow severe degree heterogeneity and a wide range of sparsity levels, and propose Stepwise Goodness of Fit (StGoF) as a new approach. This is a stepwise algorithm, where for m = 1, 2, ..., we alternately use a community detection step and a goodness of fit (GoF) step. We adapt SCORE Jin for community detection, and propose a new GoF metric. We show that at step m, the GoF metric diverges to infinity in probability for all m < K and converges to N(0,1) if m = K. This gives rise to a consistent estimate for K. Also, we discover the right way to define the signal-to-noise ratio (SNR) for our problem and show that consistent estimates for K do not exist if SNR -> 0, and StGoF is uniformly consistent for K if SNR -> infinity. Therefore, StGoF achieves the optimal phase transition. Similar stepwise methods are known to face analytical challenges. We overcome the challenges by using a different stepwise scheme in StGoF and by deriving sharp results that are not available before. The key to our analysis is to show that SCORE has the Nonsplitting Property (NSP). Primarily due to a nontractable rotation of eigenvectors dictated by the Davis-Kahan sin(theta) theorem, the NSP is nontrivial to prove and requires new techniques we develop. Supplementary materials for this article are available online.
引用
收藏
页码:2101 / 2116
页数:16
相关论文
共 43 条
[1]  
[Anonymous], ARXIV150700827
[2]  
Bollobas B., 1998, Modern Graph Theory, V184
[3]  
Cammarata L., 2021, POWER ENHANCEM UNPUB
[4]  
Chen E.Y., 2020, ARXIV200705521
[5]   Network Cross-Validation for Determining the Number of Communities in Network Data [J].
Chen, Kehui ;
Lei, Jing .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2018, 113 (521) :241-251
[6]   A mixture model for random graphs [J].
Daudin, J. -J. ;
Picard, F. ;
Robin, S. .
STATISTICS AND COMPUTING, 2008, 18 (02) :173-183
[7]   ROTATION OF EIGENVECTORS BY A PERTURBATION .3. [J].
DAVIS, C ;
KAHAN, WM .
SIAM JOURNAL ON NUMERICAL ANALYSIS, 1970, 7 (01) :1-&
[8]   Higher criticism for detecting sparse heterogeneous mixtures [J].
Donoho, D ;
Jin, JS .
ANNALS OF STATISTICS, 2004, 32 (03) :962-994
[9]   Large-scale simultaneous hypothesis testing: The choice of a null hypothesis [J].
Efron, B .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2004, 99 (465) :96-104
[10]  
Fan J., 2019, J R STAT SOC