Clustering tagged documents with labeled and unlabeled documents

被引:11
作者
Liu, Chien-Liang [1 ]
Hsaio, Wen-Hoar [1 ]
Lee, Chia-Hoang [1 ]
Chen, Chun-Hsien [1 ]
机构
[1] Natl Tsing Hua Univ, Dept Comp Sci, Hsinchu 300, Taiwan
关键词
Text mining; Document clustering; Semi-supervised clustering; Tagged document clustering;
D O I
10.1016/j.ipm.2012.12.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study employs our proposed semi-supervised clustering method called Constrained-PLSA to cluster tagged documents with a small amount of labeled documents and uses two data sets for system performance evaluations. The first data set is a document set whose boundaries among the clusters are not clear; while the second one has clear boundaries among clusters. This study employs abstracts of papers and the tags annotated by users to cluster documents. Four combinations of tags and words are used for feature representations. The experimental results indicate that almost all of the methods can benefit from tags. However, unsupervised learning methods fail to function properly in the data set with noisy information, but Constrained-PLSA functions properly. In many real applications, background knowledge is ready, making it appropriate to employ background knowledge in the clustering process to make the learning more fast and effective. (C) 2012 Elsevier Ltd. All rights reserved.
引用
收藏
页码:596 / 606
页数:11
相关论文
共 39 条
[1]  
Amini M.-R., 2002, Proceedings of SIGIR 2002. Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P105
[2]  
[Anonymous], 2010, P 27 INT C MACHINE
[3]  
[Anonymous], 2004, P 10 ACM SIGKDD INT, DOI DOI 10.1145/1014052.1014062
[4]  
[Anonymous], 2008, Introduction to information retrieval
[5]  
[Anonymous], 2006, P 15 INT C WORLD WID
[6]  
Basu S., 2002, P INT C MACH LEARN, P27
[7]  
Berendt B., 2007, ICWSM 07
[8]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[9]  
Brooks C. H., 2005, AAAI SPRING S COMP A
[10]  
Chen Ling., 2009, Web Search and Data Mining, P84, DOI DOI 10.1145/1498759.1498812