Fuzzy Based Scalable Clustering Algorithms for Handling Big Data Using Apache Spark

被引：42

作者：

Bharill, Neha ^{[1
]}

Tiwari, Aruna ^{[1
]}

Malviya, Aayushi ^{[1
]}

机构：

[1] Department of Computer Science and Engineering, Indian Institute of Technology, Indore,453552, India

来源：

IEEE Transactions on Big Data | 2016年 / 2卷 / 04期

关键词：

Clustering algorithms - Information use - Learning systems - Copying - Information analysis - Iterative methods - Big data;

D O I：

10.1109/TBDATA.2016.2622288

中图分类号：

学科分类号：

摘要：

A huge amount of digital data containing useful information, called Big Data, is generated everyday. To mine such useful information, clustering is widely used data analysis technique. A large number of Big Data analytics frameworks have been developed to scale the clustering algorithms for big data analysis. One such framework called Apache Spark works really well for iterative algorithms by supporting in-memory computations, scalability etc. We focus on the design and implementation of partitional based clustering algorithms on Apache Spark, which are suited for clustering large datasets due to their low computational requirements. In this paper, we propose Scalable Random Sampling with Iterative Optimization Fuzzy c-Means algorithm (SRSIO-FCM) implemented on an Apache Spark Cluster to handle the challenges associated with big data clustering. Experimental studies on various big datasets have been conducted. The performance of SRSIO-FCM is judged in comparison with the proposed scalable version of the Literal Fuzzy c-Means (LFCM) and Random Sampling plus Extension Fuzzy c-Means (rseFCM) implemented on the Apache Spark cluster. The comparative results are reported in terms of time and space complexity, run time and measure of clustering quality, showing that SRSIO-FCM is able to run in much less time without compromising the clustering quality. © 2015 IEEE.

引用

页码：339 / 352