Clustering Heterogeneous Web Data Using Clustering by Compression. Cluster Validity

被引:1
作者
Cernian, Alexandra [1 ]
Carstoiu, Dorin [1 ]
Olteanu, Adriana [1 ]
机构
[1] Univ Politehn Bucuresti, Fac Automat Control & Comp Sci, Bucharest, Romania
来源
PROCEEDINGS OF THE 10TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING | 2009年
关键词
clustering; heterogeneous data; cluster validity;
D O I
10.1109/SYNASC.2008.64
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The expansive nature of the Internet produced a vast quantity of unstructured data, compared to our conception of a conventional data base. The application of clustering on the World Wide Web is essential to get structured information from this sea of information. In this paper, we intend to test the results of a new clustering technique - clustering by compression - when applied to heterogeneous sets of data. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). In order to validate the results, we calculate some quality indices. If the values we obtain prove a high quality of the clustering, in the near future we plan to include the clustering by compression technique into a framework for clustering heterogeneous web objects.
引用
收藏
页码:123 / 126
页数:4
相关论文
共 11 条
[1]  
Beeferman D., 2000, Proceedings. KDD-2000. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P407, DOI 10.1145/347090.347176
[2]   Partitioning-based clustering for Web document categorization [J].
Boley, D ;
Gini, M ;
Gross, R ;
Han, EH ;
Hastings, K ;
Karypis, G ;
Kumar, V ;
Mobasher, B ;
Moore, J .
DECISION SUPPORT SYSTEMS, 1999, 27 (03) :329-341
[3]   Clustering by compression [J].
Cilibrasi, R ;
Vitányi, PMB .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2005, 51 (04) :1523-1545
[4]  
Grunwald P., 2004, SHANNON INFORM KOLMO
[5]  
Guha S., 1998, P ACM SIGMOD C
[6]  
Han J., 2012, Data Mining, P393, DOI [DOI 10.1016/B978-0-12-381479-1.00009-5, 10.1016/B978-0-12-381479-1.00009-5]
[7]  
Hofmann T, 1999, IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, P682
[8]  
Hofmann T., 1999, Advances in Neural Information Processing Systems, V11
[9]  
SU Z, 2001, P 34 HAW INT C SYST
[10]  
UNGAR LH, 1998, P WORKSH REC SYST 15