A Generic Method for Accelerating LSH-Based Similarity Join Processing

被引:15
|
作者
Yu, Chenyun [1 ]
Nutanong, Sarana [1 ]
Li, Hangyu [1 ]
Wang, Cong [1 ]
Yuan, Xingliang [1 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
关键词
Similarity join; locality sensitive hashing; query processing; representative selection;
D O I
10.1109/TKDE.2016.2638838
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.
引用
收藏
页码:712 / 726
页数:15
相关论文
共 49 条
  • [21] Signature-Based Trajectory Similarity Join
    Ta, Na
    Li, Guoliang
    Xie, Yongqing
    Li, Changqi
    Hao, Shuang
    Feng, Jianhua
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (04) : 870 - 883
  • [22] PPIS-JOIN: A Novel Privacy-Preserving Image Similarity Join Method
    Chengyuan Zhang
    Fangxin Xie
    Hao Yu
    Jianfeng Zhang
    Lei Zhu
    Yangding Li
    Neural Processing Letters, 2022, 54 : 2783 - 2801
  • [23] A Density-Aware Similarity Join Query Processing Algorithm on MapReduce
    Jang, Miyoung
    Song, Youngho
    Chang, Jae-Woo
    ADVANCED MULTIMEDIA AND UBIQUITOUS ENGINEERING: FUTURETECH & MUE, 2016, 393 : 469 - 475
  • [24] String similarity join with different similarity thresholds based on novel indexing techniques
    Chuitian Rong
    Yasin N. Silva
    Chunqing Li
    Frontiers of Computer Science, 2017, 11 : 307 - 319
  • [25] String similarity join with different similarity thresholds based on novel indexing techniques
    Rong, Chuitian
    Silva, Yasin N.
    Li, Chunqing
    FRONTIERS OF COMPUTER SCIENCE, 2017, 11 (02) : 307 - 319
  • [26] Optimizing join index based join processing: A graph partitioning approach
    Ravada, S
    Shekhar, S
    Lu, CT
    Chawla, S
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 302 - 308
  • [27] Fuzzy Similarity Join Algorithm Based on Dynamic Double Prefixes
    Yu C.-Y.
    Wang W.-H.
    Wen X.-J.
    Zhao Y.-H.
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2022, 43 (03): : 321 - 327
  • [28] Refining High-frequency-queries-based Filter for Similarity Join
    Chongstitvatana, Jaruloj
    Thitinanrungkit, Natthee
    2015 INTERNATIONAL COMPUTER SCIENCE AND ENGINEERING CONFERENCE (ICSEC), 2015, : 68 - 72
  • [29] PAX-BASED JOIN PROCESSING FOR FLASH MEMORY
    Lee, Jae-Uk
    Chung, Tae-Sun
    2011 INTERNATIONAL CONFERENCE ON MECHANICAL ENGINEERING AND TECHNOLOGY (ICMET 2011), 2011, : 737 - 740
  • [30] Efficient top-k similarity join processing over multi-valued objects
    Zhang, Wenjie
    Zhan, Liming
    Zhang, Ying
    Cheema, Muhammad Aamir
    Lin, Xuemin
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2014, 17 (03): : 285 - 309