A progressive clustering algorithm to group the XML data by structural and semantic similarity

被引:16
作者
Nayak, Richi [1 ]
Tran, Tien [1 ]
机构
[1] Queensland Univ Technol, Sch Informat Syst, Brisbane, Qld, Australia
关键词
XML; clustering; structure; semantic; heterogeneous;
D O I
10.1142/S0218001407005648
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate.
引用
收藏
页码:723 / 743
页数:21
相关论文
共 28 条
  • [1] Abiteboul S., 1999, DATA WEB RELATIONS S
  • [2] [Anonymous], 1997, SIGMOD WORKSH RES IS
  • [3] A methodology for clustering XML documents by structure
    Dalamagas, T
    Cheng, T
    Winkel, KJ
    Sellis, T
    [J]. INFORMATION SYSTEMS, 2006, 31 (03) : 187 - 228
  • [4] DO HH, 2002, 28 VLDB HONG KONG CH
  • [5] Fellbaum C, 1998, WORDNET ELECT LEXICA
  • [6] Fast detection of XML structural similarity
    Flesca, S
    Manco, G
    Masciari, E
    Pontieri, L
    Pugliese, A
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (02) : 160 - 175
  • [7] GIUMCHIGLIA F, 2004, MEANING COORDINATION
  • [8] Han J., 2012, Data Mining, P393, DOI [DOI 10.1016/B978-0-12-381479-1.00009-5, 10.1016/B978-0-12-381479-1.00001-0]
  • [9] Data clustering: A review
    Jain, AK
    Murty, MN
    Flynn, PJ
    [J]. ACM COMPUTING SURVEYS, 1999, 31 (03) : 264 - 323
  • [10] JEONG HH, 2004, 23 INT C CONC MOD SH