Self-Tuned Descriptive Document Clustering Using a Predictive Network

被引:13
作者
Brockmeier, Austin J. [1 ]
Mu, Tingting [2 ]
Ananiadou, Sophia [2 ]
Goulermas, John Y. [1 ]
机构
[1] Univ Liverpool, Sch Elect Engn Elect & Comp Sci, Liverpool L69 3BX, Merseyside, England
[2] Univ Manchester, Sch Comp Sci, Manchester M1 7DN, Lancs, England
基金
英国医学研究理事会;
关键词
Descriptive clustering; feature selection; logistic regression; model selection; sparse models; FEATURE-SELECTION; INFORMATION; REGULARIZATION; ALGORITHM; MODELS; SPACE;
D O I
10.1109/TKDE.2017.2781721
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Descriptive clustering consists of automatically organizing data instances into clusters and generating a descriptive summary for each cluster. The description should inform a user about the contents of each cluster without further examination of the specific instances, enabling a user to rapidly scan for relevant clusters. Selection of descriptions often relies on heuristic criteria. We model descriptive clustering as an auto-encoder network that predicts features from cluster assignments and predicts cluster assignments from a subset of features. The subset of features used for predicting a cluster serves as its description. For text documents, the occurrence or count of words, phrases, or other attributes provides a sparse feature representation with interpretable feature labels. In the proposed network, cluster predictions are made using logistic regression models, and feature predictions rely on logistic or multinomial regression models. Optimizing these models leads to a completely self-tuned descriptive clustering approach that automatically selects the number of clusters and the number of features for each cluster. We applied the methodology to a variety of short text documents and showed that the selected clustering, as evidenced by the selected feature subsets, are associated with a meaningful topical organization.
引用
收藏
页码:1929 / 1942
页数:14
相关论文
共 83 条
[1]  
ACKLEY DH, 1985, COGNITIVE SCI, V9, P147
[2]  
Aletras N, 2014, ACM-IEEE J CONF DIG, P239, DOI 10.1109/JCDL.2014.6970174
[3]  
[Anonymous], P PAC S BIOC
[4]  
[Anonymous], 2004, Proceedings of the 42nd annual meeting on Association for Computational Linguistics, DOI DOI 10.3115/1218955.1218990
[5]  
[Anonymous], 1971, The SMART Retrieval System-Experiments in Automatic Document Processing
[6]  
[Anonymous], P 31 ANN M OH STAT U
[7]  
[Anonymous], 1973, 2 INT S INF THEOR BU, DOI [10.1007/978-1-4612-0919-5_38, 10.1007/978-0-387-98135-2, DOI 10.1007/978-1-4612-0919-538, 10.1007/978-1-4612-1694-0]
[8]  
[Anonymous], Proceedings of ICML Workshop on Unsupervised and Transfer Learning
[9]  
[Anonymous], USE CATEGORIES CLUST
[10]  
[Anonymous], 2006, THESIS