Exploiting Wikipedia Knowledge for Conceptual Hierarchical Clustering of Documents

被引:20
作者
Spanakis, Gerasimos [1 ]
Siolas, Georgios [1 ]
Stafylopatis, Andreas [1 ]
机构
[1] Natl Tech Univ Athens, Sch Elect & Comp Engn, Intelligent Syst Lab, Athens 15780, Greece
关键词
document clustering; document representation; Wikipedia knowledge; conceptual clustering;
D O I
10.1093/comjnl/bxr024
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a novel method for conceptual hierarchical clustering of documents using knowledge extracted from Wikipedia. The proposed method overcomes the classic bag-of-words models disadvantages through the exploitation of Wikipedia textual content and link structure. A robust and compact document representation is built in real-time using the Wikipedia application programmer's interface, without the need to store locally any Wikipedia information. The clustering process is hierarchical and extends the idea of frequent items by using Wikipedia article titles for selecting cluster labels that are descriptive and important for the examined corpus. Experiments show that the proposed technique greatly improves over the baseline approach, both in terms of F-measure and entropy on the one hand and computational cost on the other.
引用
收藏
页码:299 / 312
页数:14
相关论文
共 26 条
[1]  
[Anonymous], 2008, International Conference on Research and Development in Information Retrieval, DOI [10.1145/, DOI 10.1145/1390334.1390367]
[2]  
[Anonymous], 2007, Proceedings of the 16th ACM Conference on Con- ference on Information and Knowledge Management, DOI DOI 10.1145/1321440.1321475.19
[3]  
[Anonymous], 1993, COMPUT LINGUIST, DOI DOI 10.21236/ADA273556
[4]  
Banerjee Somnath, 2007, 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P787, DOI 10.1145/1277741.1277909
[5]   Learning ontologies to improve text clustering and classification [J].
Bloehdorn, S ;
Cimiano, P ;
Hotho, A .
FROM DATA AND INFORMATION ANALYSIS TO KNOWLEDGE ENGINEERING, 2006, :334-+
[6]  
Breaux T.D., 2005, P 38 ANN HAW INT C, P111
[7]  
[Carnegie Group I. Reuters L.], 1997, REUT 21578 TEXT CAT
[8]  
Francis W.N., 1964, STANDARD CORPUS PRES
[9]  
Fung BCM, 2003, SIAM PROC S, P59
[10]  
Gabrilovich E., 2006, AAAI, P1301