Learning-based query optimization for multi-probe approximate nearest neighbor search

被引:0
作者
Pengcheng Zhang
Bin Yao
Chao Gao
Bin Wu
Xiao He
Feifei Li
Yuanfei Lu
Chaoqun Zhan
Feilong Tang
机构
[1] Shanghai Jiao Tong University,
[2] Alibaba Cloud,undefined
[3] Hangzhou Institute of Advance Technology,undefined
来源
The VLDB Journal | 2023年 / 32卷
关键词
Approximate nearest neighbor search; Query optimization; Machine learning;
D O I
暂无
中图分类号
学科分类号
摘要
Approximate nearest neighbor search (ANNS) is a fundamental problem that has attracted widespread attention for decades. Multi-probe ANNS is one of the most important classes of ANNS methods, playing crucial roles in disk-based, GPU-based, and distributed scenarios. The state-of-the-art multi-probe ANNS approaches typically perform in a fixed-configuration manner. For example, each query is dispatched to a fixed number of partitions to run ANNS algorithms locally, and the results will be merged to obtain the final result set. Our observation shows that such fixed configurations typically lead to a non-optimal accuracy–efficiency trade-off. To further optimize multi-probe ANNS, we propose to generate efficient configurations for each query individually. By formalizing the per-query optimization as a 0–1 knapsack problem and its variants, we identify that the kNN distribution (the proportion of k nearest neighbors of a query placed in each partition) is essential to the optimization. Then we develop LEQAT (LEarned Query-Aware OpTimizer), which leverages kNN distribution to seek optimal configurations for each query. LEQAT comes with (i) a machine learning model to learn and estimate kNN distributions based on historical or sample queries and (ii) efficient query optimization algorithms to determine the partitions to probe and the number of searching neighbors in each partition. We apply LEQAT to three state-of-the-art ANNS methods IVF, HNSW, and SSG under clustering-based partitioning, evaluating the overall performance on several real-world datasets. The results show that LEQAT consistently reduces the latency by up to 58% and improves the throughput by up to 3.9 times.
引用
收藏
页码:623 / 645
页数:22
相关论文
共 87 条
  • [1] Adeniyi DA(2016)Automated web usage data mining and recommendation system using Appl. Comput. Inform. 12 90-108
  • [2] Wei Z(2014)-nearest neighbor ( IEEE Trans. Pattern Anal. Mach. Intell. 37 1247-1260
  • [3] Yongquan Y(1975)NN) classification method Commun. ACM 18 509-517
  • [4] Babenko A(2014)The inverted multi-index Int. J. Database Theory Appl. 7 61-70
  • [5] Lempitsky V(1987)Multidimensional binary search trees used for associative searching Eur. J. Oper. Res. 28 3-21
  • [6] Bentley JL(2019)NN based machine learning approach for text and document mining VLDB 12 461-474
  • [7] Bijalwan V(1998)Exact methods for the knapsack problem and its generalizations Atmos. Environ. 32 2627-2636
  • [8] Kumar V(1999)Fast approximate nearest neighbor search with the navigating spreading-out graph VLDB 99 518-529
  • [9] Kumari P(1984)Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences IEEE Assp Mag. 1 4-29
  • [10] Pascual J(1979)Similarity search in high dimensions via hashing J. R. Stat. Soc. Ser. C (Appl. Stat.) 28 100-108