A document clustering algorithm for discovering and describing topics

被引:26
作者
Anaya-Sanchez, Henry [1 ]
Pons-Porrata, Aurora [2 ]
Berlanga-Llavori, Rafael [1 ]
机构
[1] Univ Jaume 1, Dept Languages & Comp Syst, Castellon de La Plana, Spain
[2] Univ Oriente, Ctr Pattern Recognit & Data Min, Santiago De Cuba, Cuba
关键词
Document clustering; Topic discovery; Topic description;
D O I
10.1016/j.patrec.2009.11.013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce a new clustering algorithm for discovering and describing the topics comprised in a text collection. Our proposal relies on both the most probable term pairs generated from the collection and the estimation of the topic homogeneity associated to these pairs Topics and their descriptions are generated from those term pairs whose support sets are homogeneous enough for representing collection topics Experimental results obtained over three benchmark text collections demonstrate the effectiveness and utility of this new approach (C) 2009 Published by Elsevier B V
引用
收藏
页码:502 / 510
页数:9
相关论文
共 16 条
[1]  
Anaya-Sanchez H, 2008, LECT NOTES COMPUT SC, V5197, P161, DOI 10.1007/978-3-540-85920-8_20
[2]  
[Anonymous], 2005, P 28 ANN INT ACM SIG, DOI DOI 10.1145/1076034
[3]  
[Anonymous], 2000, Pattern Classification
[4]  
Berlanga-Llavori R, 2008, LECT NOTES COMPUT SC, V5290, P312, DOI 10.1007/978-3-540-88309-8_32
[5]  
Buckley C., 1995, OVERVIEW 3 TEXT RETR, P69
[6]  
CUTTING DR, 1992, SIGIR 92 : PROCEEDINGS OF THE FIFTEENTH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P318
[7]  
Dunning T., 1993, Computational Linguistics, V19, P61
[8]  
Fung BCM, 2003, SIAM PROC S, P59
[9]   Text document clustering based on frequent word meaning sequences [J].
Li, Yanjun ;
Chung, Soon M. ;
Holt, John D. .
DATA & KNOWLEDGE ENGINEERING, 2008, 64 (01) :381-404
[10]  
LIN CY, 2000, 18 INT C COMP LING C, P495