A Distributed Density-Grid Clustering Algorithm for Multi-Dimensional Data

被引：0

作者：

Brown, Daniel ^{[1
]}

Shi, Yong ^{[1
]}

机构：

[1] Kennesaw State Univ, Coll Comp & Software Engn, Marietta, GA 30060 USA

来源：

2020 10TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC) | 2020年

关键词：

Clustering; Density-Based Clustering; Grid-Based Clustering; Parallel Computing; Distributed Computing; Apache Spark;

D O I：

10.1109/ccwc47524.2020.9031132

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

In recent years there have been many massive leaps in technology that have also resulted in large advancements in how we collect and use data. These advancements have caused a rise in the prominence of the field of Big Data. Organizations and businesses rely heavily on data analysis in almost every field of work. This need for data analysis combined with larger and more complex datasets has caused many challenges for these groups as they seek to keep up. Clustering is a field of data analysis, specifically unsupervised machine learning, that is heavily used in many different industries. Traditional clustering algorithms typically suffer in performance and accuracy as datasets increase in size and dimensionality. We previously proposed a new clustering algorithm called the Fast Density-Grid clustering algorithm that successfully alleviated some of the problems related to runtimes. In modern data analysis however, serial algorithms are still too slow to be of much use. The Fast Density-Grid algorithm was originally designed with parallelization in mind, and this paper discusses the steps taken to implement this. Our experimental results show that, when the number of records in the dataset exceed a certain amount, the parallel form of the algorithm overtakes the traditional in performance. Studying this critical point allows us to determine whether or not the algorithm is suitable for real world use.

引用

页码：1 / 7

页数：7

共 16 条

[1]

Apache Software Foundation, SPARK 211 DOCUMENTAT

[2]

Apache Software Foundation, SPARK 211 JAVADOC

[3] Fuzzy Based Scalable Clustering Algorithms for Handling Big Data Using Apache Spark [J].

Bharill, Neha ;

Tiwari, Aruna ;

Malviya, Aayushi .

IEEE Transactions on Big Data, 2016, 2 (04) :339-352

[4] A Big Data Clustering Algorithm for Mitigating the Risk of Customer Churn [J].

Bi, Wenjie ;

Cai, Meili ;

Liu, Mengqi ;

Li, Guo .

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2016, 12 (03) :1270-1281

[5]

Brown D, 2019, 2019 IEEE 9TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), P48, DOI 10.1109/CCWC.2019.8666548

[6]

Ester M., 1996, KDD-96 Proceedings. Second International Conference on Knowledge Discovery and Data Mining, P226

[7] Big Data Software Analytics with Apache Spark [J].

Gousios, Georgios .

PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, :542-543

[8]

Han J, 2012, MOR KAUF D, P1

[9] MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce [J].

He, Yaobin ;

Tan, Haoyu ;

Luo, Wuman ;

Mao, Huajian ;

Ma, Di ;

Feng, Shengzhong ;

Fan, Jianping .

2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, :473-480

[10] In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model [J].

Huang, Wei ;

Meng, Lingkui ;

Zhang, Dongying ;

Zhang, Wen .

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2017, 10 (01) :3-19

← 1 2 →