A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark

被引:2
作者
Ling, Huidong [1 ]
Zhu, Xinmu [1 ]
Zhu, Tao [1 ]
Nie, Mingxing [1 ]
Liu, Zhenghai [1 ]
Liu, Zhenyu [1 ]
机构
[1] Univ South China, Sch Comp Sci, Hengyang 421200, Peoples R China
基金
中国国家自然科学基金;
关键词
multiobjective clustering; Apache Spark; multiobjective particle swarm optimization (MOPSO);
D O I
10.3390/e25020259
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Multiobjective clustering algorithm using particle swarm optimization has been applied successfully in some applications. However, existing algorithms are implemented on a single machine and cannot be directly parallelized on a cluster, which makes it difficult for existing algorithms to handle large-scale data. With the development of distributed parallel computing framework, data parallelism was proposed. However, the increase in parallelism will lead to the problem of unbalanced data distribution affecting the clustering effect. In this paper, we propose a parallel multiobjective PSO weighted average clustering algorithm based on apache Spark (Spark-MOPSO-Avg). First, the entire data set is divided into multiple partitions and cached in memory using the distributed parallel and memory-based computing of Apache Spark. The local fitness value of the particle is calculated in parallel according to the data in the partition. After the calculation is completed, only particle information is transmitted, and there is no need to transmit a large number of data objects between each node, reducing the communication of data in the network and thus effectively reducing the algorithm's running time. Second, a weighted average calculation of the local fitness values is performed to improve the problem of unbalanced data distribution affecting the results. Experimental results show that the Spark-MOPSO-Avg algorithm achieves lower information loss under data parallelism, losing about 1% to 9% accuracy, but can effectively reduce the algorithm time overhead. It shows good execution efficiency and parallel computing capability under the Spark distributed cluster.
引用
收藏
页数:14
相关论文
共 31 条
[1]  
Abubaker A, 2015, PLOS ONE, V10, DOI [10.1371/journal.pone.0135641, 10.1371/journal.pone.0130995]
[2]  
Aljarah I, 2012, WOR CONG NAT BIOL, P104, DOI 10.1109/NaBIC.2012.6402247
[3]   Multiobjective clustering analysis using particle swarm optimization [J].
Armano, Giuliano ;
Farmani, Mohammad Reza .
EXPERT SYSTEMS WITH APPLICATIONS, 2016, 55 :184-193
[4]  
Chen C., 2018, RECENT DEV DATA SCI, P341
[5]  
Chen C.Y., 2012, Electrical Power Distribution Networks (EPDC), Proceedings of 17th Conference on, P789, DOI DOI 10.1016/J.ASOC.2015.07.005
[6]  
Chen HW, 2019, PROCEEDINGS OF 2019 IEEE 3RD INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2019), P408, DOI [10.1109/itnec.2019.8729350, 10.1109/ITNEC.2019.8729350]
[7]  
Coello CAC, 2004, IEEE T EVOLUT COMPUT, V8, P256, DOI [10.1109/TEVC.2004.826067, 10.1109/tevc.2004.826067]
[8]  
Dai H, 2019, 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), P333, DOI [10.1109/ICCCBDA.2019.8725648, 10.1109/icccbda.2019.8725648]
[9]   A bi-objective load balancing model in a distributed simulation system using NSGA-II and MOPSO approaches [J].
Ding, Shuxin ;
Chen, Chen ;
Xin, Bin ;
Pardalos, Panos M. .
APPLIED SOFT COMPUTING, 2018, 63 :249-267
[10]   A survey of kernel and spectral methods for clustering [J].
Filippone, Maurizio ;
Camastra, Francesco ;
Masulli, Francesco ;
Rovetta, Stefano .
PATTERN RECOGNITION, 2008, 41 (01) :176-190