Semi-supervised model-based document clustering: A comparative study

被引:0
作者
Shi Zhong
机构
[1] Florida Atlantic University,Department of Computer Science and Engineering
来源
Machine Learning | 2006年 / 65卷
关键词
Semi-supervised clustering; Seeded clustering; Constrained clustering; Clustering with feedback; Model-based clustering; Deterministic annealing;
D O I
暂无
中图分类号
学科分类号
摘要
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.
引用
收藏
页码:3 / 29
页数:26
相关论文
共 27 条
[1]  
Banfield J. D.(1993)Model-based Gaussian and non-Gaussian clustering Biometrics 49 803-821
[2]  
Raftery A. E.(1996)The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter IEEE Trans. Inform. Theory 42 2102-2117
[3]  
Castelli V.(1977)Maximum-likelihood from incomplete data via the EM algorithm Journal of the Royal Statistical Society B 39 1-38
[4]  
Cover T. M.(2001)Concept decompositions for large sparse text data using clustering Machine Learning 42 143-175
[5]  
Dempster A. P.(1971)The use of hierarchical clustering in information retrieval Information Storage and Retrieval 7 217-240
[6]  
Laird N. M.(1994)A new initialization technique for generalized Lloyd iteration IEEE Signal Processing Letters 1 144-146
[7]  
Rubin D. B.(1975)Statistics of directional data J. Royal Statistical Society. Series B (Methodological) 37 349-393
[8]  
Dhillon I. S.(2001)An experimental comparison of model-based clustering methods Machine Learning 42 9-29
[9]  
Modha D. S.(2000)Text classification from labeled and unlabeled documents using EM Machine Learning 39 103-134
[10]  
Jardine N.(2002)Cluster ensembles—a knowledge reuse framework for combining partitions Journal of Machine Learning Research 3 583-617