A Generic Method for Accelerating LSH-Based Similarity Join Processing

被引:15
|
作者
Yu, Chenyun [1 ]
Nutanong, Sarana [1 ]
Li, Hangyu [1 ]
Wang, Cong [1 ]
Yuan, Xingliang [1 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
关键词
Similarity join; locality sensitive hashing; query processing; representative selection;
D O I
10.1109/TKDE.2016.2638838
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.
引用
收藏
页码:712 / 726
页数:15
相关论文
共 49 条
  • [31] PAX-based Join Processing for Flash Memory
    Lee, Jae-Uk
    Chung, Tae-Sun
    2011 INTERNATIONAL CONFERENCE ON COMPUTERS, COMMUNICATIONS, CONTROL AND AUTOMATION (CCCA 2011), VOL I, 2010, : 450 - 453
  • [32] Efficient top-k similarity join processing over multi-valued objects
    Wenjie Zhang
    Liming Zhan
    Ying Zhang
    Muhammad Aamir Cheema
    Xuemin Lin
    World Wide Web, 2014, 17 : 285 - 309
  • [33] How improve Set Similarity Join based on prefix approach in distributed environment
    Zhu, Song
    Gagliardelli, Luca
    Simonini, Giovanni
    Beneventano, Domenico
    PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 844 - 851
  • [34] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
    Nie, Tiezheng
    Lee, Wang-chien
    Shen, Derong
    Yu, Ge
    Kou, Yue
    WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149
  • [35] Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture
    Xu K.
    Nie T.
    Shen D.
    Kou Y.
    Yu G.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (03): : 598 - 608
  • [36] Handling data-skewness in character based string similarity join using Hadoop
    Meena, Kanak
    Tayal, Devendra K.
    Castillo, Oscar
    Jain, Amita
    APPLIED COMPUTING AND INFORMATICS, 2022, 18 (1/2) : 22 - 44
  • [37] Distributed Similarity Join Over Data Streams Based on Earth Mover's Distance
    Xu J.
    Song C.
    Lv P.
    Li T.-S.
    Jisuanji Xuebao/Chinese Journal of Computers, 2019, 42 (08): : 1779 - 1796
  • [38] Parallel set similarity join on big data based on Locality-Sensitive Hashing
    Sohrabi, Mohammad Karim
    Azgomi, Hosseion
    SCIENCE OF COMPUTER PROGRAMMING, 2017, 145 : 1 - 12
  • [39] Optimizing Nonindexed Join Processing in Flash Storage-Based Systems
    Li, Yu
    On, Sai Tung
    Xu, Jianliang
    Choi, Byron
    Hu, Haibo
    IEEE TRANSACTIONS ON COMPUTERS, 2013, 62 (07) : 1417 - 1431
  • [40] Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework
    Ma, Youzhong
    Zhang, Ruiling
    Cui, Zhanyou
    Lin, Chunjie
    IEEE ACCESS, 2020, 8 : 121665 - 121677