Learning Balanced Tree Indexes for Large-Scale Vector Retrieval

被引:2
作者
Li, Wuchao [1 ]
Feng, Chao [1 ]
Lian, Defu [1 ]
Xie, Yuxin [2 ]
Liu, Haifeng [2 ]
Ge, Yong [3 ]
Chen, Enhong [1 ,4 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Guangdong OPPO Mobile Telecommun Corp Ltd, Dongguan, Guangdong, Peoples R China
[3] Univ Arizona, Tucson, AZ USA
[4] State Key Lab Cognit Intelligence, Hefei, Anhui, Peoples R China
来源
PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023 | 2023年
基金
中国国家自然科学基金;
关键词
Vector Retrieval; Nearest Neighbor Search (NNS); Maximum Inner Product Search (MIPS); Learn to index; Tree; Transformer; BINARY SEARCH TREES; PRODUCT QUANTIZATION; VORONOI DIAGRAMS; NEIGHBOR;
D O I
10.1145/3580305.3599406
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vector retrieval focuses on finding the k-nearest neighbors from a bunch of data points, and is widely used in a diverse set of areas such as information retrieval and recommender system. The current state-of-the-art methods represented by HNSW usually generate indexes with a big memory footprint, restricting the scale of data they can handle, except resorting to a hybrid index with external storage. The space-partitioning learned indexes, which only occupy a small memory, have made great breakthroughs in recent years. However, these methods rely on a large amount of labeled data for supervised learning, so model complexity affects the generalization. To this end, we propose a lightweight learnable hierarchical space partitioning index based on a balanced K-ary tree, called BAlanced Tree Learner (BATL), where the same bucket of data points are represented by a path from the root to the corresponding leaf. Instead of mapping each query into a bucket, BATL classifies it into a sequence of branches (i.e. a path), which drastically reduces the number of classes and potentially improves generalization. BATL updates the classifier and the balanced tree in an alternating way. When updating the classifier, we innovatively leverage the sequence-to-sequence learning paradigm for learning to route each query into the ground-truth leaf on the balanced tree. Retrieval is then boiled down into a sequence (i.e. path) generation task, which can be simply achieved by beam search on the encoderdecoder. When updating a balanced tree, we apply the classifier for navigating each data point into the tree nodes layer by layer under the balance constraints. We finally evaluate BATL with several large-scale vector datasets, where the experimental results show the superiority of the proposed method to the SOTA baselines in the tradeoff among latency, accuracy, and memory cost.
引用
收藏
页码:1353 / 1362
页数:10
相关论文
共 42 条
[1]  
[Anonymous], 2013, C LEARN THEOR
[2]   ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms [J].
Aumuller, Martin ;
Bernhardsson, Erik ;
Faithfull, Alexander .
INFORMATION SYSTEMS, 2020, 87
[3]  
AURENHAMMER F, 1991, COMPUT SURV, V23, P345, DOI 10.1145/116873.116880
[4]   Efficient Indexing of Billion-Scale datasets of deep descriptors [J].
Babenko, Artem ;
Lempitsky, Victor .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2055-2063
[5]   The Inverted Multi-Index [J].
Babenko, Artem ;
Lempitsky, Victor .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2015, 37 (06) :1247-1260
[6]   Additive Quantization for Extreme Vector Compression [J].
Babenko, Artem ;
Lempitsky, Victor .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :931-938
[7]  
Baranchuk D, 2019, PR MACH LEARN RES, V97
[8]   MULTIDIMENSIONAL BINARY SEARCH TREES USED FOR ASSOCIATIVE SEARCHING [J].
BENTLEY, JL .
COMMUNICATIONS OF THE ACM, 1975, 18 (09) :509-517
[9]   MULTIDIMENSIONAL BINARY SEARCH TREES IN DATABASE APPLICATIONS [J].
BENTLEY, JL .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1979, 5 (04) :333-340
[10]  
Charikar Moses S, 2002, Proceedings Of The Thiry-Fourth Annual ACM Symposium On Theory Of Computing, STOC '02, P380, DOI DOI 10.1145/509907.509965