A Trie Based Set Similarity Query Algorithm

被引:1
作者
Jia, Lianyin [1 ,2 ]
Tang, Junzhuo [1 ]
Li, Mengjuan [3 ]
Li, Runxin [1 ]
Ding, Jiaman [1 ]
Chen, Yinong [4 ]
机构
[1] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming 650500, Peoples R China
[2] Kunming Univ Sci & Technol, Yunnan Key Lab Artificial Intelligence, Kunming 650500, Peoples R China
[3] Yunnan Normal Univ, Lib, Kunming 650500, Peoples R China
[4] Arizona State Univ, Sch Comp & Augmented Intelligence, Tempe, AZ 85287 USA
基金
中国国家自然科学基金;
关键词
set similarity query; T-starTrie; FMNodes; TT-SSQ; EFFICIENT;
D O I
10.3390/math11010229
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Set similarity query is a primitive for many applications, such as data integration, data cleaning, and gene sequence alignment. Most of the existing algorithms are inverted index based, they usually filter unqualified sets one by one and do not have sufficient support for duplicated sets, thus leading to low efficiency. To solve this problem, this paper designs T-starTrie, an efficient trie based index for set similarity query, which can naturally group sets with the same prefix into one node, and can filter all sets corresponding to the node at a time, thereby significantly improving the candidates generation efficiency. In this paper, we find that the set similarity query problem can be transformed into matching nodes of the first-layer (FMNodes) detecting problem on T-starTrie. Therefore, an efficient FLMNode detection algorithm is designed. Based on this, an efficient set similarity query algorithm, TT-SSQ, is implemented by developing a variety of filtering techniques. Experimental results show that TT-SSQ can be up to 3.10x faster than existing algorithms.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] XPath Query Optimization Based on Tree Automata
    Wang, Lan-ye
    Hong, Xiao-guang
    INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2012), 2013, 8768
  • [22] On the correctness of a lock-free compression-based elastic mechanism for a hash trie design
    Areias, Miguel
    Rocha, Ricardo
    COMPUTING, 2022, 104 (10) : 2279 - 2305
  • [23] Evaluation of grinding surface roughness based on gradient similarity and color similarity
    Fang, Runji
    Yi, Huaian
    Shu, Aihua
    Lv, Xiao
    SURFACE TOPOGRAPHY-METROLOGY AND PROPERTIES, 2022, 10 (03)
  • [24] Novel itinerary-based KNN query algorithm leveraging grid division routing in wireless sensor networks of skewness distribution
    Han, Yibo
    Tang, Jine
    Zhou, ZhangBing
    Xiao, Mingzhong
    Sun, Limin
    Wang, Qun
    PERSONAL AND UBIQUITOUS COMPUTING, 2014, 18 (08) : 1989 - 2001
  • [25] A refined twig-join swift query algorithm for diversification issues of XML
    Kung, Yi-Wei
    Chang, Hsu-Kuang
    Lee, Chung-Nan
    JOURNAL OF INFORMATION SCIENCE, 2016, 42 (04) : 568 - 578
  • [26] String similarity join with different similarity thresholds based on novel indexing techniques
    Rong, Chuitian
    Silva, Yasin N.
    Li, Chunqing
    FRONTIERS OF COMPUTER SCIENCE, 2017, 11 (02) : 307 - 319
  • [27] Query the trajectory based on the precise track: a Bloom filter-based approach
    Wang, Zengjie
    Luo, Wen
    Yuan, Linwang
    Gao, Hong
    Wu, Fan
    Hu, Xu
    Yu, Zhaoyuan
    GEOINFORMATICA, 2021, 25 (02) : 397 - 416
  • [28] Similarity-based reasoning in conceptual spaces
    Douven, Igor
    Verheyen, Steven
    Elqayam, Shira
    Gardenfors, Peter
    Osta-Velez, Matias
    FRONTIERS IN PSYCHOLOGY, 2023, 14
  • [29] Subtree Similarity Search Based on Structure and Text
    Mizokami, Takuya
    Bou, Savong
    Amagasa, Toshiyuki
    BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2024, 2024, 14912 : 72 - 87
  • [30] FlashTrie: Hash-based Prefix-Compressed Trie for IP Route Lookup Beyond 100Gbps
    Bando, Masanori
    Chao, H. Jonathan
    2010 PROCEEDINGS IEEE INFOCOM, 2010,