Clustering XML Documents by Combining Content and Structure

被引:9
作者
Guo Yongming [1 ]
Chen Dehua [1 ]
Le Jiajin [1 ]
机构
[1] Donghua Univ, Sch Comp Sci & Technol, Shanghai 201620, Peoples R China
来源
ISISE 2008: INTERNATIONAL SYMPOSIUM ON INFORMATION SCIENCE AND ENGINEERING, VOL 1 | 2008年
关键词
Clusering; XML; Extended Vector Space Model;
D O I
10.1109/ISISE.2008.301
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
XML has become a de facto standard for data representation and exchange over the Internet. With the emergence of more and more XML documents, the clustering of XML documents has become an active research area. XML documents lie between structured data and unstructured data which describe both content and structure, so how to effectively cluster XML documents is a huge challenge. However, most of existing clustering algorithms are based on the structural similarities between XML documents and not or less take into account content of the XML documents. In this paper, we develop a novel method for measuring similarities between XML documents, which effectively combines structure and contents of the XML documents. Based on this similarity model, we adopt hierarchy clustering algorithm to cluster XML documents. The experiments show that this method gains better clustering quality.
引用
收藏
页码:583 / 587
页数:5
相关论文
共 12 条
[1]   A methodology for clustering XML documents by structure [J].
Dalamagas, T ;
Cheng, T ;
Winkel, KJ ;
Sellis, T .
INFORMATION SYSTEMS, 2006, 31 (03) :187-228
[2]  
Denoyer L., 2007, ACM SIGIR FORUM, V41, P79
[3]  
Larsen B., 1999, P 5 ACM SIGKDD INT C, P16, DOI [10.1145/312129.312186, DOI 10.1145/312129.312186]
[4]   Preparations for semantics-based XML mining [J].
Lee, JW ;
Lee, K ;
Kim, W .
2001 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2001, :345-352
[5]  
LEUNG HP, P WIRI2005
[6]   An efficient and scalable algorithm for clustering XML documents by structure [J].
Lian, W ;
Cheung, DWL ;
Mamoulis, N ;
Yiu, SM .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (01) :82-96
[7]  
MOH C, 2000, P 2 INT WORKSH ADV I
[8]   Fast and effective clustering of XML data using structural information [J].
Nayak, Richi .
KNOWLEDGE AND INFORMATION SYSTEMS, 2008, 14 (02) :197-215
[9]  
*SAX, SIMPL API XML
[10]  
SCHONAUER S, 2003, THESIS LUDWIGMAXIMIL