Fast detection of XML structural similarity

被引:63
|
作者
Flesca, S
Manco, G
Masciari, E
Pontieri, L
Pugliese, A
机构
[1] CNR, ICAR, Inst High Performance Comp & Networks, I-87036 Arcavacata Di Rende, CS, Italy
[2] Univ Calabria, I-87036 Arcavacata Di Rende, CS, Italy
关键词
Web mining; mining methods and algorithms; XML/XSL/RDF; text mining; similarity measures;
D O I
10.1109/TKDE.2005.27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be compared as to their structural similarity, in order to group them into clusters so that different storage, retrieval, and processing techniques can be effectively exploited. In this scenario, an efficient and effective similarity function is the key of a successful data management process. We present an approach for detecting structural similarity between XML documents which significantly differs from standard methods based on graph-matching algorithms, and allows a significant reduction of the required computation costs. Our proposal roughly consists of linearizing the structure of each XML document, by representing it as a numerical sequence and, then, comparing such sequences through the analysis of their frequencies. First, some basic strategies for encoding a document are proposed, which can focus on diverse structural facets. Moreover, the theory of Discrete Fourier Transform is exploited to effectively and efficiently compare the encoded documents (i.e., signals) in the domain of frequencies. Experimental results reveal the effectiveness of the approach, also in comparison with standard methods.
引用
收藏
页码:160 / 175
页数:16
相关论文
共 30 条
  • [1] Indexing useful structural patterns for XML query processing
    Lian, W
    Mamoulis, N
    Cheung, DWL
    Yiu, SM
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (07) : 997 - 1009
  • [2] Coreference detection in an XML schema
    Szymczak, Marcin
    Zadrozny, Slawomir
    Bronselaer, Antoon
    De Tre, Guy
    INFORMATION SCIENCES, 2015, 296 : 237 - 262
  • [3] On structural information similarity measurements
    Wei, Jin-Mao
    Wang, Shu-Qin
    Zheng, Wei
    Wang, Jing
    You, Jun-Ping
    Zhang, Jie
    Liu, Dan
    2006 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, 2006, : 124 - +
  • [4] Approximation of protein structure for fast similarity measures
    Lotan, I
    Schwarzer, F
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2004, 11 (2-3) : 299 - 317
  • [5] A polygraph test for trustworthy structural similarity
    Naude, Kevin A.
    Greyling, Jean H.
    Vogts, Dieter
    INFORMATION SYSTEMS, 2017, 64 : 194 - 205
  • [6] New Statistic Detector for Structural Image Similarity
    Diaw, Moustapha
    Retraint, Florent
    Morain-Nicolier, Frederic
    Delahaies, Agnes
    Landre, Jerome
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2025, 73 : 1168 - 1183
  • [7] Automatic Plagiarism Detection Using Similarity Analysis
    Hariharan, Shanmugasundaram
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2012, 9 (04) : 322 - 326
  • [8] Fast kernel for calculating structural information similarities
    Wei, Jin-Mao
    Wang, Shu-Qin
    Wang, Jing
    You, Jun-Ping
    2006 3RD INTERNATIONAL IEEE CONFERENCE INTELLIGENT SYSTEMS, VOLS 1 AND 2, 2006, : 55 - 60
  • [9] On Object Detection Based on Similarity Measures from Digital Maps
    Marzinkowski, Arthur
    Benferhat, Salem
    Paparrizou, Anastasia
    Piette, Cedric
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 3, INTELLISYS 2023, 2024, 824 : 81 - 97
  • [10] Source Code Clone Detection Using Unsupervised Similarity Measures
    Martinez-Gil, Jorge
    SOFTWARE QUALITY AS A FOUNDATION FOR SECURITY, SWQD 2024, 2024, 505 : 21 - 37