A term correlation based semi-supervised microblog clustering with dual constraints

被引:3
作者
Ma, Huifang [1 ,2 ]
Zhang, Di [1 ]
Jia, Meihuizi [1 ]
Lin, Xianghong [1 ]
机构
[1] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Gansu, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100085, Peoples R China
基金
中国国家自然科学基金;
关键词
Semi-supervised clustering; Microblogs; Dual constraints; Term correlation matrix; Nonnegative matrix factorization; NONNEGATIVE MATRIX FACTORIZATION;
D O I
10.1007/s13042-017-0750-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Microblog clustering is very important in many web applications. However, microblogs do not provide sufficient word occurrences. Meanwhile the limited length of these messages prevents traditional text clustering approaches from being employed to their full potential. To address this problem, in this paper, we propose a novel semi-supervised learning scheme fully exploring the semantic information to compensate for the limited message length. The key idea is to explore term correlation data, which well captures the semantic information for term weighting and provides greater context for microblogs. We then formulate microblog clustering problem as a semi-supervised non-negative matrix factorization co-clustering framework, which takes advantage of both prior domain knowledge of data points (microblogs) in the form of pair-wise constraints and category knowledge of features (terms). Our approach not only greatly reduces the labor-intensive labeling process, but also deeply exploits hidden information from microblog itself. Extensive experiments are conducted on two real-world microblog datasets. The results demonstrate the effectiveness of the proposed approach which produces promising performance as compared to state-of-the-art methods.
引用
收藏
页码:679 / 692
页数:14
相关论文
共 41 条
[1]  
[Anonymous], 2012, P 24 INT C COMPUTATI
[2]  
[Anonymous], 2013, P 2013 SIAM INT C DA
[3]  
Banerjee A., 2004, KDD, P509, DOI DOI 10.1145/1014052.1014111
[4]  
Basu S., 2002, P INT C MACH LEARN, P27
[5]  
Carter S, 2011, P 11 DUTCH BELG INF, P12
[6]  
Chang H., 2004, P INT C MACHINE LEAR, P153, DOI DOI 10.1145/1015330.1015391
[7]   Non-Negative Matrix Factorization for Semisupervised Heterogeneous Data Coclustering [J].
Chen, Yanhua ;
Wang, Lijun ;
Dong, Ming .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2010, 22 (10) :1459-1474
[8]   Coupled Term-Term Relation Analysis for Document Clustering [J].
Cheng, Xin ;
Miao, Duoqian ;
Wang, Can ;
Cao, Longbing .
2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
[9]  
Dhillon IS, 2003, P 9 ACM SIGKDD INT C, P89, DOI DOI 10.1145/956750.956764
[10]  
Gu QQ, 2009, KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, P359