A probabilistic relational approach for web document clustering

被引:9
作者
Fersini, E. [1 ]
Messina, E. [1 ]
Archetti, F. [1 ,2 ]
机构
[1] Univ Milano Bicocca, Dipartimento Informat Sistemist & Comunicaz, Milan, Italy
[2] Consorzio Milano Ric, I-20126 Milan, Italy
关键词
Relational document clustering; Relational web structure estimation;
D O I
10.1016/j.ipm.2009.08.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The exponential growth of information available on the World Wide Web, and retrievable by search engines, has implied the necessity to develop efficient and effective methods for organizing relevant contents. In this field document clustering plays an important role and remains an interesting and challenging problem in the field of web computing. In this paper we present a document clustering method, which takes into account both contents information and hyperlink structure of web page collection, where a document is viewed as a set of semantic units. We exploit this representation to determine the strength of a relation between two linked pages and to define a relational clustering algorithm based on a probabilistic graph representation. The experimental results show that the proposed approach, called RED-clustering, outperforms two of the most well known clustering algorithm as k-Means and Expectation Maximization. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:117 / 130
页数:14
相关论文
共 28 条
[1]  
[Anonymous], J MACH LEARN RES
[2]  
Archetti F, 2006, LECT NOTES COMPUT SC, V4027, P257, DOI 10.1007/11766254_22
[3]  
CAI D, 2003, P 5 AS PAC WEB C, P406
[4]  
Chakrabarti S., 1998, SIGMOD Record, V27, P307, DOI 10.1145/276305.276332
[5]  
CUTTING DR, 1992, SIGIR 92 : PROCEEDINGS OF THE FIFTEENTH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P318
[6]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[7]  
DITTENBACH M, 2001, P INT C ART NEUR NET
[8]   Enhancing web page classification through image-block importance analysis [J].
Fersini, E. ;
Messina, E. ;
Archetti, F. .
INFORMATION PROCESSING & MANAGEMENT, 2008, 44 (04) :1431-1447
[9]  
FRIEDMAN N, 2001, P 16 INT JOINT C ART, P1300
[10]  
Fung BC, 2003, P 3 SIAM INT C DAT M