Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing

被引:3
作者
Chakrabarti, Aniket [1 ]
Parthasarathy, Srinivasan [1 ]
机构
[1] Ohio State Univ, Columbus, OH 43210 USA
来源
PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015) | 2015年
基金
美国国家科学基金会;
关键词
AllPairs Similarity Search; Locality Sensitive Hashing; Sequential Hypothesis Tests and Confidence Intervals;
D O I
10.1145/2736277.2741665
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some extent and provide substantial speedup over traditional index based approaches. BayesLSH is used for pruning the candidate space and computation of approximate similarity, whereas BayesLSHLite can only prune the candidates, but similarity needs to be computed exactly on the original data. Thus where ever the explicit data representation is available and exact similarity computation is not too expensive, BayesLSHLite can be used to aggressively prune candidates and provide substantial speedup without losing too much on quality. However, the loss in quality is higher in the BayesLSH variant, where explicit data representation is not available, rather only a hash sketch is available and similarity has to be estimated approximately. In this work we revisit the LSH problem from a Frequentist setting and formulate sequential tests for composite hypothesis (similarity greater than or less than threshold) that can be leveraged by such LSH algorithms for adaptively pruning candidates aggressively. We propose a vanilla sequential probability ratio test (SPRT) approach based on this idea and two novel variants. We extend these variants to the case where approximate similarity needs to be computed using fixed-width sequential confidence interval generation technique. We compare these novel variants with the SPRT variant and BayesLSH/Bayes-LSHLite variants and show that they can provide tighter qualitative guarantees over BayesLSH/BayesLSHLite - a state-of-the-art approach - while being upto 2.1x faster than a traditional SPRT and 8.8x faster than AIIPairs.
引用
收藏
页码:162 / 172
页数:11
相关论文
共 22 条
  • [1] [Anonymous], 1998, P 13 ANN ACM S THEOR
  • [2] [Anonymous], 2007, MATH STAT DATA ANAL
  • [3] [Anonymous], 2007, WWW
  • [4] [Anonymous], 2010, WWW
  • [5] [Anonymous], 1973, Sequential analysis
  • [6] [Anonymous], 1992, Information retrieval: Data structures and algorithms
  • [7] [Anonymous], 2009, Introduction to Semi-Supervised Learning, DOI [DOI 10.2200/S00196ED1V01Y200906AIM006, 10.2200/S00196ED1V01Y200906AIM006]
  • [8] [Anonymous], 1998, STOC
  • [9] [Anonymous], 1999, VLDB
  • [10] Broder A.Z., 1997, WWW