String similarity join with different similarity thresholds based on novel indexing techniques

被引:3
作者
Rong, Chuitian [1 ]
Silva, Yasin N. [2 ]
Li, Chunqing [1 ]
机构
[1] Tianjin Polytech Univ, Sch Comp Sci & Software Engn, Tianjin 300387, Peoples R China
[2] Arizona State Univ, Sch Math & Nat Sci, Tempe, AZ 85281 USA
基金
中国国家自然科学基金;
关键词
similarity join; similarity aware index; similarity thresholds; EFFICIENT;
D O I
10.1007/s11704-016-5231-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.
引用
收藏
页码:307 / 319
页数:13
相关论文
共 30 条
  • [21] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
    Nie, Tiezheng
    Lee, Wang-chien
    Shen, Derong
    Yu, Ge
    Kou, Yue
    WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149
  • [22] Distributed Similarity Join Over Data Streams Based on Earth Mover's Distance
    Xu J.
    Song C.
    Lv P.
    Li T.-S.
    Jisuanji Xuebao/Chinese Journal of Computers, 2019, 42 (08): : 1779 - 1796
  • [23] Parallel set similarity join on big data based on Locality-Sensitive Hashing
    Sohrabi, Mohammad Karim
    Azgomi, Hosseion
    SCIENCE OF COMPUTER PROGRAMMING, 2017, 145 : 1 - 12
  • [24] Near-Duplicate Video Detection Based on an Approximate Similarity Self-Join Strategy
    da Silva, Henrique B.
    do Patrocinio, Zenilton K. G., Jr.
    Gravier, Guillaume
    Amsaleg, Laurent
    Araujo, Arnaldo de A.
    Guimaraes, Silvio Jamil F.
    2016 14TH INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING (CBMI), 2016,
  • [25] LSH-based distributed similarity indexing with load balancing in high-dimensional space
    Wu, Jiagao
    Shen, Lu
    Liu, Linfeng
    JOURNAL OF SUPERCOMPUTING, 2020, 76 (01) : 636 - 665
  • [26] Automatic and online setting of similarity thresholds in content-based visual information retrieval problems
    Izaquiel L. Bessas
    Flávio L. C. Pádua
    Guilherme T. de Assis
    Rodrigo T. N. Cardoso
    Anisio Lacerda
    EURASIP Journal on Advances in Signal Processing, 2016
  • [27] Automatic and online setting of similarity thresholds in content-based visual information retrieval problems
    Bessas, Izaquiel L.
    Padua, Flavio L. C.
    de Assis, Guilherme T.
    Cardoso, Rodrigo T. N.
    Lacerda, Anisio
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016, : 1 - 16
  • [28] Finding a Set of High-frequency Queries for High-frequency-query-based Filter for Similarity Join
    Kunanusont, Kamolwan
    Chongstitvatana, Jaruloj
    2015 12TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY (ECTI-CON), 2015,
  • [29] A novel recommendation system comprising WNMF with graph-based static and temporal similarity estimators
    Gupta, Anshul
    Shrinath, Pravin
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2023, 16 (01) : 27 - 41
  • [30] A novel self-similarity cluster grouping approach for individual tree crown segmentation using multi-features from UAV-based LiDAR and multi-angle photogrammetry data
    Lei, Lingting
    Chai, Guoqi
    Yao, Zongqi
    Li, Yingbo
    Jia, Xiang
    Zhang, Xiaoli
    REMOTE SENSING OF ENVIRONMENT, 2025, 318