A Generic Method for Accelerating LSH-Based Similarity Join Processing

被引:15
|
作者
Yu, Chenyun [1 ]
Nutanong, Sarana [1 ]
Li, Hangyu [1 ]
Wang, Cong [1 ]
Yuan, Xingliang [1 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
关键词
Similarity join; locality sensitive hashing; query processing; representative selection;
D O I
10.1109/TKDE.2016.2638838
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.
引用
收藏
页码:712 / 726
页数:15
相关论文
共 49 条
  • [41] Exploiting spatial indexes for semijoin-based join processing in distributed spatial databases
    Tan, KL
    Ooi, BC
    Abel, DJ
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2000, 12 (06) : 920 - 937
  • [42] Efficient index-based KNN join processing for high-dimensional data
    Yu, Cui
    Cui, Bin
    Wang, Shuguang
    Su, Jianwen
    INFORMATION AND SOFTWARE TECHNOLOGY, 2007, 49 (04) : 332 - 344
  • [43] Finding a Set of High-frequency Queries for High-frequency-query-based Filter for Similarity Join
    Kunanusont, Kamolwan
    Chongstitvatana, Jaruloj
    2015 12TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY (ECTI-CON), 2015,
  • [44] Adaptive LSH based on the particle swarm method with the attractor selection model for fast approximation of Gaussian process regression
    Okadome, Yuya
    Urai, Kenji
    Nakamura, Yutaka
    Yomo, Tetsuya
    Ishiguro, Hiroshi
    ARTIFICIAL LIFE AND ROBOTICS, 2014, 19 (03) : 220 - 226
  • [45] A Method of Twig Query Processing Based on XML Schema
    Li, Suming
    2017 2ND AASRI INTERNATIONAL CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (IEA 2017), 2017, : 72 - 75
  • [46] An Online Approximate Aggregation Query Processing Method Based on Hadoop
    Zhang, Zhiqiang
    Hu, Jianghua
    Xie, Xiaoqin
    Pan, Haiwei
    Feng, Xiaoning
    2016 IEEE 20TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2016, : 117 - 122
  • [47] PATRIoTA: A Similarity-based IoT Malware Detection Method Robust Against Adversarial Samples
    Sandor, Jozsef
    Nagy, Roland
    Buttyan, Levente
    2023 IEEE INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND COMMUNICATIONS, EDGE, 2023, : 344 - 353
  • [48] A Method of XML Twig Query Processing based on XML Document Schema
    Yu, Yi
    PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON MECHANICAL, ELECTRONIC, CONTROL AND AUTOMATION ENGINEERING (MECAE 2017), 2017, 61 : 172 - 175
  • [49] An efficient query processing method for location-based application on mobile computing
    Kim, SH
    Chung, W
    Bae, HY
    6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XV, PROCEEDINGS: MOBILE/WIRELESS COMPUTING AND COMMUNICATION SYSTEMS III, 2002, : 101 - 106