A Fast Projection-Based Algorithm for Clustering Big Data

被引:0
作者
Yun Wu
Zhiquan He
Hao Lin
Yufei Zheng
Jingfen Zhang
Dong Xu
机构
[1] Xiamen University of Technology,College of Computer and Information Engineering
[2] University of Missouri,Department of Computer Science and Christopher S. Bond Life Sciences Center
[3] Shenzhen University,College of Information Engineering
[4] University of Electronic Science and Technology of China,Key Laboratory for Neuro
来源
Interdisciplinary Sciences: Computational Life Sciences | 2019年 / 11卷
关键词
Big data analysis; Clustering; Projection; MUFOLD-CL;
D O I
暂无
中图分类号
学科分类号
摘要
With the fast development of various techniques, more and more data have been accumulated with the unique properties of large size (tall) and high dimension (wide). The era of big data is coming. How to understand and discover new knowledge from these data has attracted more and more scholars’ attention and has become the most important task in data mining. As one of the most important techniques in data mining, clustering analysis, a kind of unsupervised learning, could group a set data into objectives(clusters) that are meaningful, useful, or both. Thus, the technique has played very important role in knowledge discovery in big data. However, when facing the large-sized and high-dimensional data, most of the current clustering methods exhibited poor computational efficiency and high requirement of computational source, which will prevent us from clarifying the intrinsic properties and discovering the new knowledge behind the data. Based on this consideration, we developed a powerful clustering method, called MUFOLD-CL. The principle of the method is to project the data points to the centroid, and then to measure the similarity between any two points by calculating their projections on the centroid. The proposed method could achieve linear time complexity with respect to the sample size. Comparison with K-Means method on very large data showed that our method could produce better accuracy and require less computational time, demonstrating that the MUFOLD-CL can serve as a valuable tool, at least may play a complementary role to other existing methods, for big data clustering. Further comparisons with state-of-the-art clustering methods on smaller datasets showed that our method was fastest and achieved comparable accuracy. For the convenience of most scholars, a free soft package was constructed.
引用
收藏
页码:360 / 366
页数:6
相关论文
共 88 条
[1]  
Chen M(2014)Big data: a survey Mobile Netw Appl 19 171-209
[2]  
Mao S(2016)LSDT: latent sparse domain transfer learning for visual adaptation IEEE Trans Image Process 25 1177-1191
[3]  
Liu Y(2016)Robust visual knowledge transfer via extreme learning machine based domain adaptation IEEE Trans Image Process 25 4959-4973
[4]  
Zhang L(2015)Evolutionary cost-sensitive extreme learning machine IEEE Trans Neural Netw Learn Syst 28 3045-3060
[5]  
Zuo W(1996)A density-based algorithm for discovering clusters in large spatial databases with noise In Kdd 96 226-231
[6]  
Zhang D(1982)Least squares quantization in pcm IEEE Trans Inf Theory 28 129-137
[7]  
Zhang L(1973)Slink: an optimally efficient algorithm for the single-link cluster method Comput J 16 30-34
[8]  
Zhang D(1977)An efficient algorithm for a complete link method Comput J 20 364-366
[9]  
Zhang L(2011)A new algorithm for initial cluster centers in k-means algorithm Pattern Recogn Lett 32 1701-1705
[10]  
Zhang D(2015)Dynamic local search based immune automatic clustering algorithm and its applications Appl Soft Comput 27 250-268