String similarity join with different similarity thresholds based on novel indexing techniques

被引:3
|
作者
Rong, Chuitian [1 ]
Silva, Yasin N. [2 ]
Li, Chunqing [1 ]
机构
[1] Tianjin Polytech Univ, Sch Comp Sci & Software Engn, Tianjin 300387, Peoples R China
[2] Arizona State Univ, Sch Math & Nat Sci, Tempe, AZ 85281 USA
基金
中国国家自然科学基金;
关键词
similarity join; similarity aware index; similarity thresholds; EFFICIENT;
D O I
10.1007/s11704-016-5231-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.
引用
收藏
页码:307 / 319
页数:13
相关论文
共 30 条
  • [1] String similarity join with different similarity thresholds based on novel indexing techniques
    Chuitian Rong
    Yasin N. Silva
    Chunqing Li
    Frontiers of Computer Science, 2017, 11 : 307 - 319
  • [2] String similarity search and join: a survey
    Minghe Yu
    Guoliang Li
    Dong Deng
    Jianhua Feng
    Frontiers of Computer Science, 2016, 10 : 399 - 417
  • [3] String similarity search and join: a survey
    Yu, Minghe
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (03) : 399 - 417
  • [4] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [5] Efficient and Scalable Processing of String Similarity Join
    Rong, Chuitian
    Lu, Wei
    Wang, Xiaoli
    Du, Xiaoyong
    Chen, Yueguo
    Tung, Anthony K. H.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) : 2217 - 2230
  • [6] Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture
    Xu K.
    Nie T.
    Shen D.
    Kou Y.
    Yu G.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (03): : 598 - 608
  • [7] Handling data-skewness in character based string similarity join using Hadoop
    Meena, Kanak
    Tayal, Devendra K.
    Castillo, Oscar
    Jain, Amita
    APPLIED COMPUTING AND INFORMATICS, 2022, 18 (1/2) : 22 - 44
  • [8] Leveraging deletion neighborhoods and trie for efficient string similarity search and join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8870
  • [9] Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 1 - 13
  • [10] Signature-Based Trajectory Similarity Join
    Ta, Na
    Li, Guoliang
    Xie, Yongqing
    Li, Changqi
    Hao, Shuang
    Feng, Jianhua
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (04) : 870 - 883