Subtree Similarity Search Based on Structure and Text

被引:0
作者
Mizokami, Takuya [1 ]
Bou, Savong [2 ]
Amagasa, Toshiyuki [2 ]
机构
[1] Univ Tsukuba, Grad Sch Sci & Technol, Tsukuba, Ibaraki, Japan
[2] Univ Tsukuba, Ctr Computat Sci, Tsukuba, Ibaraki, Japan
来源
BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2024 | 2024年 / 14912卷
关键词
Approximate Matching; Similarity search; Tree edit distance; TREE EDIT DISTANCE; ALGORITHMS; EFFICIENT; FRAMEWORK; ROBUST;
D O I
10.1007/978-3-031-68323-7_6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a query tree, the subtree similarity search problem is finding all subtrees in a document tree that are similar to the query tree. The previous scan-based method extracts candidate subtrees based on the size difference, which only considers the structural differences and ignores the differences in the contents represented by the trees. For this reason, it suffers from the following two issues. First, for queries against a tree with a regular structure, it is difficult to differentiate subtrees in terms of structural similarity, yielding a large number of candidate results to verify. Second, the candidates are verified by computing the tree edit distance, which is cubic to the number of tree nodes. In this paper, we propose a solution for the subtree similarity search problem based on the structure and contents of the trees. We demonstrate through experiments that our proposed method outperforms the previous scan-based methods in terms of speed and is competitive with index-based methods.
引用
收藏
页码:72 / 87
页数:16
相关论文
共 26 条
  • [1] Tree Edit Distance Problems: Algorithms and Applications to Bioinformatics
    Akutsu, Tatsuya
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2010, E93D (02): : 208 - 218
  • [2] TASM: Top-k Approximate Subtree Matching
    Augsten, Nikolaus
    Barbosa, Denilson
    Boehlen, Michael
    Palpanas, Themis
    [J]. 26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, : 353 - 364
  • [3] A survey on tree edit distance and related problems
    Bille, P
    [J]. THEORETICAL COMPUTER SCIENCE, 2005, 337 (1-3) : 217 - 239
  • [4] Cohen Sara, 2013, P ACM SIGMOD INT C M, P49
  • [5] An Optimal Decomposition Algorithm for Tree Edit Distance
    Demaine, Erik D.
    Mozes, Shay
    Rossman, Benjamin
    Weimann, Oren
    [J]. ACM TRANSACTIONS ON ALGORITHMS, 2009, 6 (01)
  • [6] Optimal aggregation algorithms for middleware
    Fagin, R
    Lotem, A
    Naor, M
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2003, 66 (04) : 614 - 656
  • [7] Falleri J.-R., 2014, P 29 ACM IEEE INT C, P313, DOI DOI 10.1145/2642937.2642982
  • [8] Change distilling:: Tree differencing for fine-grained source code change extraction
    Fluri, Beat
    Wuersch, Michael
    Pinzger, Martin
    Gall, Harald C.
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2007, 33 (11) : 725 - 743
  • [9] Guha S., 2002, P ACM SIGMOD INT C M, P287
  • [10] Average complexity of the Jiang-Wang-Zhang pairwise tree alignment algorithm and of a RNA secondary structure alignment algorithm
    Herrbach, Claire
    Denise, Alain
    Dulucq, Serge
    [J]. THEORETICAL COMPUTER SCIENCE, 2010, 411 (26-28) : 2423 - 2432