Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

被引:10
作者
Wang, Yiqiu [1 ]
Shrivastava, Anshumali [1 ]
Wang, Jonathan [1 ]
Ryu, Junghee [1 ]
机构
[1] Rice Univ, Houston, TX 77251 USA
来源
SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2018年
关键词
Similarity search; locality sensitive hashing; reservoir sampling; GPGPU;
D O I
10.1145/3183713.3196925
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present FLASH (Fast LSH Algorithm for Similarity search accelerated with HPC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n(2)D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results(1).
引用
收藏
页码:889 / 903
页数:15
相关论文
共 45 条
  • [1] THE FAST JOHNSON-LINDENSTRAUSS TRANSFORM AND APPROXIMATE NEAREST NEIGHBORS
    Ailon, Nir
    Chazelle, Bernard
    [J]. SIAM JOURNAL ON COMPUTING, 2009, 39 (01) : 302 - 322
  • [2] Andoni A, 2015, ADV NEUR IN, V28
  • [3] [Anonymous], INTRO WEBB SPAM CORP
  • [4] [Anonymous], 2014, Hashing for similarity search: A survey
  • [5] [Anonymous], 2007, P 33 INT C VER LARG
  • [6] [Anonymous], 2017, ARXIV170305160
  • [7] [Anonymous], E 2 LSH 0 1 USER MAN
  • [8] [Anonymous], 2012, Theory of computing, DOI DOI 10.4086/TOC.2012.V008A014
  • [9] [Anonymous], 2001, P 20 ACM SIGMOD SIGA
  • [10] [Anonymous], 2007, P 16 INT C WORLD WID, DOI DOI 10.1145/1242572.1242610