Bootstrapping K-means for Big data analysis

被引:0
作者
Han, Jungkyu [1 ]
Luo, Min [1 ]
机构
[1] Nippon Telegraph & Tel, Software Innovat Ctr, Tokyo, Japan
来源
2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2014年
关键词
Clustering; k-means; Big data; Bootstrap; Bootstapping;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, "Big data" has become a popular word in industrial field. Distributed data processing middleware such as Hadoop makes companies to be able to extract useful information from their big data. However, information retrieval from newly available big data is difficult even with the aid of distributed data processing because the task needs many cycles of hypothesis establishment and test due to lack of prior knowledge about the data. K-means algorithm is one of popular algorithms which can be used in earlier stages of data mining because of the algorithm's speed and unsupervised characteristics. However, with big data, even k-means algorithm is not fast enough to get a desired result in an expected time period. In the paper, we propose a fast k-means method based on statistical bootstrapping technique. Our proposed method achieves roughly 100 times speedup and similar accuracy compared to Lloyd algorithm which is the most popular k-means algorithm in industrial field.
引用
收藏
页码:591 / 596
页数:6
相关论文
共 19 条
[1]   NP-hardness of Euclidean sum-of-squares clustering [J].
Aloise, Daniel ;
Deshpande, Amit ;
Hansen, Pierre ;
Popat, Preyas .
MACHINE LEARNING, 2009, 75 (02) :245-248
[2]  
[Anonymous], 2012, P 29 INT C MACH LEAR
[3]  
[Anonymous], COV TYP REM SENS GIS
[4]  
[Anonymous], ACM SIGKDD EXPLORATI
[5]  
[Anonymous], 2007, P 18 ANN ACM SIAM S
[6]  
[Anonymous], 1993, INTRO BOOTSTRAP
[7]  
Braverman Vladmir, 2011, ACM SIAM S DISCR ALG
[8]  
Czumaj Artur, 2007, J RANDOM STRUCTURES, V30, P257
[9]  
Davidson Ian, 2003, IEEE DAT MIN WORKSH
[10]  
Ene A., 2011, P 17 ACM KDD, P681, DOI DOI 10.1145/2020408.2020515