p-PIC: Parallel power iteration clustering for big data

被引:26
|
作者
Yan, Weizhong [1 ]
Brahmakshatriya, Umang [1 ]
Xue, Ya [1 ]
Gilder, Mark [2 ]
Wise, Bowden [3 ]
机构
[1] GE Global Res Ctr, Machine Learning Lab, Niskayuna, NY 12039 USA
[2] GE Global Res Ctr, Comp & Cyber Secur Lab, Niskayuna, NY 12039 USA
[3] GE Global Res Ctr, Knowledge Discovery Lab, Niskayuna, NY 12039 USA
关键词
Big data; Clustering; Cloud computing; Data-mining; Distributed computing; Machine learning; Parallel computing; Spectral clustering; ALGORITHM;
D O I
10.1016/j.jpdc.2012.06.009
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Power iteration clustering (PIC) is a newly developed clustering algorithm. It performs clustering by embedding data points in a low-dimensional subspace derived from the similarity matrix. Compared to traditional clustering algorithms, PIC is simple, fast and relatively scalable. However, it requires the data and its associated similarity matrix fit into memory, which makes the algorithm infeasible for big data applications. This paper attempts to expand PIC's data scalability by implementing a parallel power iteration clustering (p-PIC). While this paper focuses on exploring different parallelization strategies and implementation details for minimizing computation and communication costs, we have also paid great attention to ensuring the algorithm works well on low-end commodity computers (COTS-based clusters and general purpose servers found at most commercial cloud providers). The experimental results demonstrate that the proposed p-PIC algorithm is highly scalable to both data and compute resources. (c) 2012 Elsevier Inc. All rights reserved.
引用
收藏
页码:352 / 359
页数:8
相关论文
共 50 条
  • [1] A survey on parallel clustering algorithms for Big Data
    Dafir, Zineb
    Lamari, Yasmine
    Slaoui, Said Chah
    ARTIFICIAL INTELLIGENCE REVIEW, 2021, 54 (04) : 2411 - 2443
  • [2] A survey on parallel clustering algorithms for Big Data
    Zineb Dafir
    Yasmine Lamari
    Said Chah Slaoui
    Artificial Intelligence Review, 2021, 54 : 2411 - 2443
  • [3] A GPU Based Parallel Clustering Method for Electric Power Big Data
    Ji, Cong
    Xiong, Zheng
    Fang, Chao
    Lv, Hui
    Zhang, Kaizhen
    2017 4TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE), 2017, : 29 - 33
  • [4] Parallel and distributed clustering framework for big spatial data mining
    Bendechache, Malika
    Tari, A-Kamel
    Kechadi, M-Tahar
    INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS, 2019, 34 (06) : 671 - 689
  • [5] Using Parallel Hierarchical Clustering to Address Spatial Big Data Challenges
    Woodley, Alan
    Tang, Ling-Xiang
    Geva, Shlomo
    Nayak, Richi
    Chappell, Timothy
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 2692 - 2698
  • [6] An Efficient Parallel Algorithm for Clustering Big Data based on the Spark Framework
    Dafir, Zineb
    Slaoui, Said
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (07) : 890 - 896
  • [7] Parallel K-prototypes for Clustering Big Data
    Ben HajKacem, Mohamed Aymen
    Ben N'cir, Chiheb-Eddine
    Essoussi, Nadia
    COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2015), PT II, 2015, 9330 : 628 - 637
  • [8] Parallel batch k-means for Big data clustering
    Alguliyev, Rasim M.
    Aliguliyev, Ramiz M.
    Sukhostat, Lyudmila, V
    COMPUTERS & INDUSTRIAL ENGINEERING, 2021, 152
  • [9] Superior Parallel Big Data Clustering Through Competitive Stochastic Sample Size Optimization in Big-Means
    Mussabayev, Rustam
    Mussabayev, Ravil
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 14796 : 224 - 236
  • [10] Big Data Clustering: A Review
    Shirkhorshidi, Ali Seyed
    Aghabozorgi, Saeed
    Teh, Ying Wah
    Herawan, Tutut
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2014, PT V, 2014, 8583 : 707 - 720