Summable and nonsummable data-driven models for community detection in feature-rich networks

被引：8

作者：

Shalileh, Soroosh ^{[1
,2
]}

Mirkin, Boris ^{[1
,3
]}

机构：

[1] HSE Univ, Lab Methods Big Data Anal, Pokrovsky Blvd 11, Moscow, Russia

[2] HSE Univ, Lab Methods Big Data Anal, Pokrovsky Blvd 11, Moscow, Russia

[3] Birkbeck Univ London, Dept Comp Sci & Informat Syst, London WC1E 7HX, England

来源：

SOCIAL NETWORK ANALYSIS AND MINING | 2021年 / 11卷 / 01期

关键词：

Attributed network; Feature-rich network; Community detection; Sequential extraction; Least squares data recovery; One-by-one clustering; K-MEANS; ALGORITHM;

D O I：

10.1007/s13278-021-00774-8

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A feature-rich network is a network whose nodes are characterized by categorical or quantitative features. We propose a data-driven model for finding a partition of the nodes to approximate both the network link data and the feature data. The model involves summary quantitative characteristics of both network links and features. We distinguish between two modes of using the network link data. One mode postulates that the link values are comparable and summable across the network (summability); the other assumption models the case in which different nodes represent different measurement systems so that the link data are neither comparable, nor summable, across different nodes (nonsummability). We derive a Pythagorean decomposition of the combined data scatter involving our data recovery least-squares criterion. We address an equivalent problem of maximizing its complementary part, the contribution of a found partition to the combined data scatter. We follow a doubly greedy strategy in maximizing that. First, communities are found one-by-one, and second, entities are added one-by-one in the process of identifying a community. Our algorithms determine the number of clusters automatically. The nonsummability version proves to have a niche of its own; also, it is faster than the other version. In our experiments, they appear to be competitive over generated synthetic data sets and six real-world data sets from the literature.

引用

页数：23

共 59 条

[1] Akoglu Leman., 2012, SDM
[2] [Anonymous], 2004, HIDDEN POWER SOCIAL
[3] [Anonymous], 2003, P TEXT MINING LINK A
[4] [Anonymous], 2012, ELEMENTS INFORM THEO
[5] [Anonymous], 2012, INT C DIGITAL SOC
[6] Covariate-assisted spectral clustering
Binkiewicz, N.
Vogelstein, J. T.
Rohe, K.
[J]. BIOMETRIKA, 2017, 104 (02) : 361 - 377
[7] Fast unfolding of communities in large networks
Blondel, Vincent D.
Guillaume, Jean-Loup
Lambiotte, Renaud
Lefebvre, Etienne
[J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2008,
[8] Bojchevski A, 2018, AAAI CONF ARTIF INTE, P2738
[9] Boyd S., 2004, CONVEX OPTIMIZATION, DOI [DOI 10.1017/CBO9780511804441, 10.1017/CBO9780511804441]
[10] Combination of links and node contents for community discovery using a graph regularization approach
Cao, Jinxin
Wang, Hongcui
Jin, Di
Dang, Jianwu
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 91 : 361 - 370

← 1 2 3 4 5 6 →