Multi-Label Regularized Generative Model for Semi-Supervised Collective Classification in Large-Scale Networks

被引：7

作者：

Wu, Qingyao ^{[1
]}

Chen, Jian ^{[1
]}

Ho, Shen-Shyang ^{[2
]}

Li, Xutao ^{[2
]}

Min, Huaqing ^{[1
]}

Han, Chao ^{[1
]}

机构：

[1] S China Univ Technol, Sch Software Engn, Guangzhou, Guangdong, Peoples R China

[2] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore

来源：

BIG DATA RESEARCH | 2015年 / 2卷 / 04期

关键词：

Collective classification; Generative model; Semi-supervised learning; Multi-label learning; Large-scale sparsely labeled networks;

D O I：

10.1016/j.bdr.2015.04.002

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The problem of collective classification(CC) for large-scale network data has received considerable attention in the last decade. Enabling CC usually increases accuracy when given a fully-labeled network with a large amount of labeled data. However, such labels can be difficult to obtain and learning a CC model with only a few such labels in large-scale sparsely labeled networks can lead to poor performance. In this paper, we show that leveraging the unlabeled portion of the data through semi-supervised collective classification(SSCC) is essential to achieving high performance. First, we describe a novel data-generating algorithm, called generative model with network regularization(GMNR), to exploit both labeled and unlabeled data in large-scale sparsely labeled networks. In GMNR, a network regularizer is constructed to encode the network structure information, and we apply the network regularizer to smooth the probability density functions of the generative model. Second, we extend our proposed GMNR algorithm to handle network data consisting of multi-label instances. This approach, called the multi-label regularized generative model(MRGM), includes an additional label regularizer to encode the label correlation, and we show how these smoothing regularizers can be incorporated into the objective function of the model to improve the performance of CC in multi-label setting. We then develop an optimization scheme to solve the objective function based on EM algorithm. Empirical results on several real-world network data classification tasks show that our proposed methods are better than the compared collective classification algorithms especially when labeled data is scarce. (C) 2015 Elsevier Inc. All rights reserved.

引用

页码：187 / 201

页数：15

共 32 条

[1] [Anonymous], 2002, P 18 C UNCERTAINTY A
[2] Bilgic M., 2010, THESIS
[3] Bilgic M, 2010, P 27 INT C MACH LEAR, P79
[4] Cai D., 2009, P 26 ANN INT C MACHI, P105, DOI DOI 10.1145/1553374.1553388
[5] Cai D, 2008, P 17 ACM C INF KNOWL, P911
[6] Graph Regularized Nonnegative Matrix Factorization for Data Representation
Cai, Deng
He, Xiaofei
Han, Jiawei
Huang, Thomas S.
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (08) : 1548 - 1560
[7] Cheng J., 2002, ACM SIGKDD EXPLORATI, V3, P47
[8] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
DEMPSTER, AP
LAIRD, NM
RUBIN, DB
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
[9] Demsar J, 2006, J MACH LEARN RES, V7, P1
[10] Faloutsos C, 2008, P 14 ACM SIGKDD INT, P256, DOI DOI 10.1145/1401890.1401925

← 1 2 3 4 →