Diamond Sampling for Approximate Maximum All-pairs Dot-product (MAD) Search

被引:30
作者
Ballard, Grey [1 ]
Kolda, Tamara G. [1 ]
Pinar, Ali [1 ]
Seshadhri, C. [2 ]
机构
[1] Sandia Natl Labs, Data Sci & Cyber Analyt Dept, Livermore, CA 94551 USA
[2] Univ Calif Santa Cruz, Dept Comp Sci, Santa Cruz, CA 95064 USA
来源
2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM) | 2015年
关键词
MONTE-CARLO ALGORITHMS; GRAM MATRIX;
D O I
10.1109/ICDM.2015.46
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given two sets of vectors, A = {(a(1)) over right arrow, . . . , (a(m)) over right arrow} and B = {(b(1)) over right arrow, . . . ,(b(n)) over right arrow}, our problem is to find the top-t dot products, i.e., the largest vertical bar(a(i)) over right arrow . (b(j)) over right arrow vertical bar among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all mn dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of A and B. The probability of selecting a diamond corresponding to pair (i, j) is proportional to ((a(i)) over right arrow . (b(j)) over right arrow)(2), amplifying the focus on the largest -magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods.
引用
收藏
页码:11 / 20
页数:10
相关论文
共 41 条
  • [1] Friends and neighbors on the Web
    Adamic, LA
    Adar, E
    [J]. SOCIAL NETWORKS, 2003, 25 (03) : 211 - 230
  • [2] Amossen Rasmus Resen, 2009, ICDT 2009, P121, DOI DOI 10.1145/1514894.1514909
  • [3] Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
    Andoni, Alexandr
    Indyk, Piotr
    [J]. COMMUNICATIONS OF THE ACM, 2008, 51 (01) : 117 - 122
  • [4] An approximate algorithm for top-k closest pairs join query in large high dimensional data
    Angiulli, F
    Pizzuti, C
    [J]. DATA & KNOWLEDGE ENGINEERING, 2005, 53 (03) : 263 - 281
  • [5] [Anonymous], MATH KERN LIB REF MA
  • [6] [Anonymous], WWW 15
  • [7] [Anonymous], 2012, P ADMIRE 12 WORKSH L
  • [8] [Anonymous], 2011, ISMIR 2011
  • [9] [Anonymous], E2 LSH MANUAL
  • [10] [Anonymous], CUDA TOOLK DOC CUBLA