Learning-based query optimization for multi-probe approximate nearest neighbor search

被引:0
作者
Zhang, Pengcheng [1 ]
Yao, Bin [1 ,3 ]
Gao, Chao [1 ]
Wu, Bin [2 ]
He, Xiao [2 ]
Li, Feifei [2 ]
Lu, Yuanfei [2 ]
Zhan, Chaoqun [2 ]
Tang, Feilong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[2] Alibaba Cloud, Hangzhou, Peoples R China
[3] Hangzhou Inst Adv Technol, Hangzhou, Peoples R China
关键词
Approximate nearest neighbor search; Query optimization; Machine learning; PRODUCT QUANTIZATION; ENGINE;
D O I
10.1007/s00778-022-00762-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Approximate nearest neighbor search (ANNS) is a fundamental problem that has attracted widespread attention for decades. Multi-probe ANNS is one of the most important classes of ANNS methods, playing crucial roles in disk-based, GPU-based, and distributed scenarios. The state-of-the-art multi-probe ANNS approaches typically perform in a fixed-configuration manner. For example, each query is dispatched to a fixed number of partitions to run ANNS algorithms locally, and the results will be merged to obtain the final result set. Our observation shows that such fixed configurations typically lead to a non-optimal accuracy-efficiency trade-off. To further optimize multi-probe ANNS, we propose to generate efficient configurations for each query individually. By formalizing the per-query optimization as a 0-1 knapsack problem and its variants, we identify that the kNN distribution (the proportion of k nearest neighbors of a query placed in each partition) is essential to the optimization. Then we develop LEQAT (LEarned Query-Aware OpTimizer), which leverages kNN distribution to seek optimal configurations for each query. LEQAT comes with (i) a machine learning model to learn and estimate kNN distributions based on historical or sample queries and (ii) efficient query optimization algorithms to determine the partitions to probe and the number of searching neighbors in each partition. We apply LEQAT to three state-of-the-art ANNS methods IVF, HNSW, and SSG under clustering-based partitioning, evaluating the overall performance on several real-world datasets. The results show that LEQAT consistently reduces the latency by up to 58% and improves the throughput by up to 3.9 times.
引用
收藏
页码:623 / 645
页数:23
相关论文
共 58 条
[1]  
Adeniyi D., 2016, APPL COMPUT INFORM, V12, P90, DOI [10.1016/j.aci.2014.10.001, DOI 10.1016/J.ACI.2014.10.001]
[2]   Learning Set Cardinality in Distance Nearest Neighbours [J].
Anagnostopoulos, Christos ;
Triantafillou, Peter .
2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, :691-696
[3]  
Andoni A., 2015, ARXIV
[4]  
[Anonymous], 2007, VLDB
[5]  
[Anonymous], 2014, International Journal of Database Theory and Application., DOI [DOI 10.14257/IJDTA.2014.7.1.06, 10.14257/ijdta.2014.7.1.0]
[6]  
[Anonymous], 1995, Backpropagation Theory, Architectures, and Applications
[7]   Efficient Indexing of Billion-Scale datasets of deep descriptors [J].
Babenko, Artem ;
Lempitsky, Victor .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2055-2063
[8]   The Inverted Multi-Index [J].
Babenko, Artem ;
Lempitsky, Victor .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2015, 37 (06) :1247-1260
[9]   MULTIDIMENSIONAL BINARY SEARCH TREES USED FOR ASSOCIATIVE SEARCHING [J].
BENTLEY, JL .
COMMUNICATIONS OF THE ACM, 1975, 18 (09) :509-517
[10]  
Chen L., 2005, P 2005 ACM SIGMOD IN, P491, DOI DOI 10.1145/1066157.1066213