A Generic Method for Accelerating LSH-Based Similarity Join Processing

被引:15
|
作者
Yu, Chenyun [1 ]
Nutanong, Sarana [1 ]
Li, Hangyu [1 ]
Wang, Cong [1 ]
Yuan, Xingliang [1 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
关键词
Similarity join; locality sensitive hashing; query processing; representative selection;
D O I
10.1109/TKDE.2016.2638838
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.
引用
收藏
页码:712 / 726
页数:15
相关论文
共 49 条
  • [11] LSH-based semantic dictionary learning for large scale image understanding
    Li, Liang
    Yan, Chenggang Clarence
    Ji, Wen
    Chen, Bo-Wei
    Jiang, Shuqiang
    Huang, Qingming
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2015, 31 : 231 - 236
  • [12] Satellite Resource Allocation via Dynamic Auctions and LSH-based Predictions
    Cheng, Lin
    Huberman, Bernardo A.
    2023 IEEE 97TH VEHICULAR TECHNOLOGY CONFERENCE, VTC2023-SPRING, 2023,
  • [13] A Distributed Near-Optimal LSH-based Framework for Privacy-Preserving Record Linkage
    Karapiperis, Dimitrios
    Verykios, Vassilios S.
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2014, 11 (02) : 745 - 763
  • [14] Efficient and Scalable Processing of String Similarity Join
    Rong, Chuitian
    Lu, Wei
    Wang, Xiaoli
    Du, Xiaoyong
    Chen, Yueguo
    Tung, Anthony K. H.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) : 2217 - 2230
  • [15] An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage
    Karapiperis, Dimitrios
    Verykios, Vassilios S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (04) : 909 - 921
  • [16] C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join
    Li, Hangyu
    Nutanong, Sarana
    Xu, Hong
    Yu, Chenyun
    Ha, Foryu
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (03) : 423 - 436
  • [17] A Prefix-Filter based Method for Spatio-Textual Similarity Join
    Liu, Sitong
    Li, Guoliang
    Feng, Jianhua
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (10) : 2354 - 2367
  • [18] Semi-Stream Similarity Join Processing in a Distributed Environment
    Kim, Hong-Ji
    Lee, Ki-Hoon
    IEEE ACCESS, 2020, 8 : 130194 - 130204
  • [19] YOLO and LSH-based video stream analytics landscape for short-term trafffiic density surveillance at road networks
    Lavanya, K.
    Tiwari, Stuti
    Anand, Rahul
    Hemanth, Jude
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2023, 31 (06) : 1099 - 1112
  • [20] PPIS-JOIN: A Novel Privacy-Preserving Image Similarity Join Method
    Zhang, Chengyuan
    Xie, Fangxin
    Yu, Hao
    Zhang, Jianfeng
    Zhu, Lei
    Li, Yangding
    NEURAL PROCESSING LETTERS, 2022, 54 (04) : 2783 - 2801