Locality Sensitive Hashing Index Based on Optimal Linear Order

被引:0
|
作者
Feng X.-K. [1 ]
Peng Y.-G. [1 ]
Cui J.-T. [1 ]
Liu Y.-F. [2 ]
Li H. [3 ,4 ]
机构
[1] School of Computer Science and Technology, Xidian University, Xi'an
[2] Department of System Engineering and Engineering Management, Chinese University of Hong Kong, Hong Kong
[3] School of Cyber Engineering, Xidian University, Xi'an
[4] National Engineering Laboratory for Public Security Risk Perception and Control by Big Data, Beijing
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2020年 / 43卷 / 05期
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Approximate nearest neighbor; High-dimensional index; Linear order; Local distribution; Locality-sensitive hashing;
D O I
10.11897/SP.J.1016.2020.00930
中图分类号
学科分类号
摘要
Nearest neighbor (NN) search in high-dimensional space is a fundamental paradigm in a wide range of applications, such as text information retrieval, search engine, content-based information query, duplication detection, etc. In these areas, data are usually large-scale and are modeled as high-dimensional features, which introduce two major problems to NN search, the "curse of dimensionality" and I/O performance bottleneck. On the one hand, the dimensionality curse makes exact NNs nearly infeasible to achieve. A lot of research efforts have been devoted to finding approximate nearest neighbors (ANN) which are close enough to the query to achieve a satisfying trade-off between accuracy and efficiency. On the other hand, large-scale feature sets are often too massive to fit into the internal memory, external storage (usually disk) becomes a reasonable choice. However, due to the huge speed gap between internal memory and external memory, the resulting input/output communication becomes very expensive, too much NN candidates or improper loading manner would make the NN candidates loading the most time-consuming part of the entire NN search, called the I/O performance bottleneck. LSH is a widely adopted technique due to its excellent error guarantee and high computational efficiency in tackling the "curse of dimensionality". It can enable fast and accurate irrelevant points filtration and offers c-ANN results at a sub-linear time complexity which also makes it an attractive approach for disk-based ANN search. Plenty of LSH based methods have been developed to further boost the ANN performance. However, most of them access candidate objects through significant random I/O operations, which makes them tend to incur the I/O performance bottleneck. In this work, a novel Locality-Sensitive Hashing (LSH) index called Optimal Order LSH (O2LSH) is designed to further address the above two problems. First, O2LSH introduces space-filling curves to sort the compound LSH keys and rearrange the original data points accordingly. In this way, NN candidates can be stored on same or adjacent disk pages so that only a few sequential I/O operations can load enough candidates during the search. A thorough quantitative analysis is then conducted on several common space-filling curves. The results show that there exists two different kinds of characteristic among these space-filling curves, (1) curves of "dimension-first-traverse" characteristic (such as the row-wise curve) tend to introduce several limitations to ANN search; (2) curves of "neighborhood-first-traverse" characteristic (such as Z-order, Gray curve, Hilbert curve, etc.) can produce better local distribution of NN candidate points, and the performance are more stable. Based on the analysis, O2LSH chooses a best "neighborhood-first-traverse" curve as the linear order to maintain as many as NN candidates within same local disk pages. In this way, O2LSH can not only enhance the I/O efficiency but also improve the ANN search accuracy. We conduct empirical experiments on 6 real-world multimedia data sets. The results demonstrate the superior accuracy and I/O efficiency of O2LSH in ANN search, compared with 4 state-of-the-art methods, including C2LSH, SK-LSH, SRS and QALSH. Particularly, O2LSH is no longer sensitive to a key parameter of LSH function as the basic sorting-based solution does, which further improves the algorithm practicality. © 2020, Science Press. All right reserved.
引用
收藏
页码:930 / 947
页数:17
相关论文
共 24 条
  • [1] Yuan Pei-Sen, Sha Chao-Feng, Wang Xiao-Ling, Zhou Ao-Ying, c-Approximate nearest neighbor query algorithm based on learning for high-dimensional data, Journal of Software, 23, 8, pp. 2018-2031, (2012)
  • [2] Vitter J S., Algorithms and data structures for external memory, Foundations and Trends in Theoretical Computer Science, 2, 4, pp. 305-474, (2008)
  • [3] Indyk P, Motwani R., Approximate nearest neighbors:Towards removing the curse of dimensionality, Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604-613, (1998)
  • [4] Datar M, Immorlica N, Indyk P, Et al., Locality-sensitive hashing scheme based on p-stable distributions, Proceedings of the 20th Annual Symposium on Computational Geometry, pp. 253-262, (2004)
  • [5] Gionis A, Indyk P, Motwani R., Similarity search in high dimensions via hashing, Proceedings of the 25th International Conference on Very Large Data Bases, 99, 6, pp. 518-529, (1999)
  • [6] Jegou H, Douze M, Schmid C., Product quantization for nearest neighbor search, IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 1, pp. 117-128, (2011)
  • [7] Wang Jianfeng, Wang Jingdong, Song Jingkuan, Et al., Optimized Cartesian k means, IEEE Transactions on Knowledge and Data Engineering, 27, 1, pp. 180-192, (2015)
  • [8] Hajebi K, Abbasi-Yadkori Y, Shahbazi H, Et al., Fast approximate nearest-neighbor search with k-nearest neighbor graph, Proceedings of the International Joint Conference on Artificial Intelligence, 22, 1, (2011)
  • [9] Malkov Y, Ponomarenko A, Logvinov A, Et al., Approximate nearest neighbor algorithm based on navigable small world graphs, Information Systems, 45, pp. 61-68, (2014)
  • [10] Dong W, Moses C, Li K., Efficient k-nearest neighbor graph construction for generic similarity measures, Proceedings of the 20th International Conference on World Wide Web, pp. 577-586, (2011)