ROBUST AND COMPUTATIONALLY FEASIBLE COMMUNITY DETECTION IN THE PRESENCE OF ARBITRARY OUTLIER NODES

被引:94
作者
Cai, T. Tony [1 ]
Li, Xiaodong [1 ]
机构
[1] Univ Penn, Wharton Sch, Dept Stat, Philadelphia, PA 19104 USA
基金
美国国家科学基金会;
关键词
Robust community detection; SDP relaxation; dual certificate; k-means clustering; STOCHASTIC BLOCKMODELS; MAXIMUM-LIKELIHOOD; SIGNAL RECOVERY; CONSISTENCY; MODELS; PREDICTION; PARTITIONS; GRAPHS;
D O I
10.1214/14-AOS1290
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Community detection, which aims to cluster N nodes in a given graph into r distinct groups based on the observed undirected edges, is an important problem in network data analysis. In this paper, the popular stochastic block model (SBM) is extended to the generalized stochastic block model (GSBM) that allows for adversarial outlier nodes, which are connected with the other nodes in the graph in an arbitrary way. Under this model, we introduce a procedure using convex optimization followed by k-means algorithm with k = r. Both theoretical and numerical properties of the method are analyzed. A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with N. This theoretical result admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers. Numerical results show that our method is both computationally fast and robust to different kinds of outliers, while some popular computationally fast community detection algorithms, such as spectral clustering applied to adjacency matrices or graph Laplacians, may fail to retrieve the major clusters due to a small portion of outliers. We apply a slight modification of our method to a political blogs data set, showing that our method is competent in practice and comparable to existing computationally feasible methods in the literature. To the best of the authors' knowledge, our result is the first in the literature in terms of clustering communities with fast growing numbers under the GSBM where a portion of arbitrary outlier nodes exist.
引用
收藏
页码:1027 / 1059
页数:33
相关论文
共 55 条
[1]  
Adamic L.A., 2005, P 3 INT WORKSH LINK, P36
[2]   Strong converse for identification via quantum channels [J].
Ahlswede, R ;
Winter, A .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2002, 48 (03) :569-579
[3]  
Airoldi EM, 2008, J MACH LEARN RES, V9, P1981
[4]   Guaranteed clustering and biclustering via semidefinite programming [J].
Ames, Brendan P. W. .
MATHEMATICAL PROGRAMMING, 2014, 147 (1-2) :429-465
[5]   Convex optimization for the planted k-disjoint-clique problem [J].
Ames, Brendan P. W. ;
Vavasis, Stephen A. .
MATHEMATICAL PROGRAMMING, 2014, 143 (1-2) :299-337
[6]   PSEUDO-LIKELIHOOD METHODS FOR COMMUNITY DETECTION IN LARGE SPARSE NETWORKS [J].
Amini, Arash A. ;
Chen, Aiyou ;
Bickel, Peter J. ;
Levina, Elizaveta .
ANNALS OF STATISTICS, 2013, 41 (04) :2097-2122
[7]  
[Anonymous], J MACH LEARN RES
[8]  
[Anonymous], ROBUST COMPUTATION S
[9]  
[Anonymous], ROLE NORMALIZATION S
[10]  
[Anonymous], FDN COMPUT IN PRESS