TURNING BIG DATA INTO TINY DATA: CONSTANT-SIZE CORESETS FOR k-MEANS, PCA, AND PROJECTIVE CLUSTERING

被引：104

作者：

Feldman, Dan ^{[1
]}

Schmidt, Melanie ^{[2
,3
]}

Sohler, Christian ^{[2
,3
,4
]}

机构：

[1] Univ Haifa, Robot & Big Data Lab, IL-3498838 Haifa, Israel

[2] Univ Cologne, D-50923 Cologne, Germany

[3] TU Dortmund, Dortmund, Germany

[4] Univ Bonn, Bonn, Germany

来源：

SIAM JOURNAL ON COMPUTING | 2020年 / 49卷 / 03期

关键词：

PCA; k-means; projective clustering; coresets; streaming; big data; LOCAL SEARCH YIELDS; APPROXIMATION; ALGORITHMS; MATRICES; BOUNDS;

D O I：

10.1137/18M1209854

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

We develop and analyze a method to reduce the size of a very large set of data points in a high-dimensional Euclidean space R-d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergeable [P. K. Agarwal et al., Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principals of Database Systems, 2012, pp. 23-34]. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets. It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on data-dependently projecting the points on a low-dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis, and subspace clustering. The main conceptual contribution is a new coreset definition that allows charging costs that appear for every solution to an additive constant.

引用

页码：601 / 657

页数：57

共 78 条

[1]

Ackermann Marcel R., 2012, ACM J EXP ALGORITHMI, V17

[2]

Agarwal Pankaj K, 2012, Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, P23, DOI [10.1145/2213556.2213562, DOI 10.1145/2213556.2213562]

[3] Approximating extent measures of points [J].

Agarwal, PK ;

Har-Peled, S ;

Varadarajan, KR .

JOURNAL OF THE ACM, 2004, 51 (04) :606-635

[4]

Aggarwal BB, 2009, DRUG DISCOV SER, V12, P15

[5] Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms [J].

Ahmadian, Sara ;

Norouzi-Fard, Ashkan ;

Svensson, Ola ;

Ward, Justin .

2017 IEEE 58TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2017, :61-72

[6] NP-hardness of Euclidean sum-of-squares clustering [J].

Aloise, Daniel ;

Deshpande, Amit ;

Hansen, Pierre ;

Popat, Preyas .

MACHINE LEARNING, 2009, 75 (02) :245-248

[7]

[Anonymous], 2007, Proc. 23rd ACM Symp. on Computational Geometry (SoCG)

[8]

Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027

[9]

Awasthi Pranjal, 2015, 31 INT S COMPUTATION, V34, P754

[10] TWICE-RAMANUJAN SPARSIFIERS [J].

Batson, Joshua ;

Spielman, Daniel A. ;

Srivastava, Nikhil .

SIAM JOURNAL ON COMPUTING, 2012, 41 (06) :1704-1721

← 1 2 3 4 5 6 7 8 →