Sub-linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data

被引:19
作者
Coleman, Benjamin [1 ]
Shrivastava, Anshumali [2 ]
机构
[1] Rice Univ, Dept Elect & Comp Engn, Houston, TX 77005 USA
[2] Rice Univ, Dept Comp Sci, Houston, TX USA
来源
WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020) | 2020年
关键词
kernel density estimation; sketching; streaming data; locality sensitive hashing; compression; SEQUENCES;
D O I
10.1145/3366423.3380244
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Kernel density estimation is a simple and effective method that lies at the heart of many important machine learning applications. Unfortunately, kernel methods scale poorly for large, high dimensional datasets. Approximate kernel density estimation has a prohibitively high memory and computation cost, especially in the streaming setting. Recent sampling algorithms for high dimensional densities can reduce the computation cost but cannot operate online, while streaming algorithms cannot handle high dimensional datasets due to the curse of dimensionality. We propose RACE, an efficient sketching algorithm for kernel density estimation on high-dimensional streaming data. RACE compresses a set of N high dimensional vectors into tiny arrays of integer counters. These arrays are sufficient to estimate the kernel density for a large class of kernels. Our one-pass sketch is simple to implement and comes with strong theoretical guarantees. We evaluate our method on real-world high-dimensional datasets and show that our sketch achieves 10x better compression compared to existing methods.(1)
引用
收藏
页码:1739 / 1749
页数:11
相关论文
共 50 条
[1]   Mergeable Summaries [J].
Agarwal, Pankaj K. ;
Cormode, Graham ;
Huang, Zengfeng ;
Phillips, Jeff M. ;
Wei, Zhewei ;
Yi, Ke .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2013, 38 (04)
[2]  
Ailon N., 2006, STOC'06. Proceedings of the 38th Annual ACM Symposium on Theory of Computing, P557, DOI 10.1145/1132516.1132597
[3]  
[Anonymous], **DATA OBJECT**, DOI DOI 10.4231/R7RX991C
[4]  
[Anonymous], 1998, LECT NOTES COMPUTER
[5]  
[Anonymous], 2014, P 31 INT C MACH LEAR
[6]  
Apple Differential Privacy Team, 2017, Apple Machine Learning Journal, V1, P8
[7]   SOMKE: Kernel Density Estimation Over Data Streams by Sequences of Self-Organizing Maps [J].
Cao, Yuan ;
He, Haibo ;
Man, Hong .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2012, 23 (08) :1254-1268
[8]   Mode-finding for mixtures of Gaussian distributions [J].
Carreira-Perpiñán, MA .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (11) :1318-1323
[9]   Hashing-Based-Estimators for Kernel Density in High Dimensions [J].
Charikar, Moses ;
Siminelakis, Paris .
2017 IEEE 58TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2017, :1032-1043
[10]  
Chen Y., 2012, ARXIV12033472