p-PIC: Parallel power iteration clustering for big data

被引:26
|
作者
Yan, Weizhong [1 ]
Brahmakshatriya, Umang [1 ]
Xue, Ya [1 ]
Gilder, Mark [2 ]
Wise, Bowden [3 ]
机构
[1] GE Global Res Ctr, Machine Learning Lab, Niskayuna, NY 12039 USA
[2] GE Global Res Ctr, Comp & Cyber Secur Lab, Niskayuna, NY 12039 USA
[3] GE Global Res Ctr, Knowledge Discovery Lab, Niskayuna, NY 12039 USA
关键词
Big data; Clustering; Cloud computing; Data-mining; Distributed computing; Machine learning; Parallel computing; Spectral clustering; ALGORITHM;
D O I
10.1016/j.jpdc.2012.06.009
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Power iteration clustering (PIC) is a newly developed clustering algorithm. It performs clustering by embedding data points in a low-dimensional subspace derived from the similarity matrix. Compared to traditional clustering algorithms, PIC is simple, fast and relatively scalable. However, it requires the data and its associated similarity matrix fit into memory, which makes the algorithm infeasible for big data applications. This paper attempts to expand PIC's data scalability by implementing a parallel power iteration clustering (p-PIC). While this paper focuses on exploring different parallelization strategies and implementation details for minimizing computation and communication costs, we have also paid great attention to ensuring the algorithm works well on low-end commodity computers (COTS-based clusters and general purpose servers found at most commercial cloud providers). The experimental results demonstrate that the proposed p-PIC algorithm is highly scalable to both data and compute resources. (c) 2012 Elsevier Inc. All rights reserved.
引用
收藏
页码:352 / 359
页数:8
相关论文
共 50 条
  • [31] Heterogeneous Distributed Big Data Clustering on Sparse Grids
    Pfander, David
    Daiss, Gregor
    Pflueger, Dirk
    ALGORITHMS, 2019, 12 (03)
  • [32] Sinh-Cosh Optimization-Based Efficient Clustering for Big Data Applications
    Khrissi, Lahbib
    Es-Sabry, Mohammed
    El Akkad, Nabil
    Satori, Hassan
    Aldosary, Saad
    El-Shafai, Walid
    IEEE ACCESS, 2024, 12 : 193676 - 193692
  • [33] Parallel Implementation of P Systems for Data Clustering on GPU
    Jin, Jie
    Liu, Hui
    Wang, Fengjuan
    Peng, Hong
    Wang, Jun
    BIO-INSPIRED COMPUTING - THEORIES AND APPLICATIONS, BIC-TA 2015, 2015, 562 : 200 - 211
  • [34] High Performance Big Data Clustering
    Agrawal, Ankit
    Patwary, Md. Mostofa Ali
    Hendrix, William
    Liao, Wei-keng
    Choudhary, Alok
    CLOUD COMPUTING AND BIG DATA, 2013, 23 : 192 - 211
  • [35] A Novel Intelligent Clustering Approach for High Dimensional Data in a Big Data Environment
    Tao, Qian
    Wang, Zhenyu
    Gu, Chunqin
    Chen, Wenyuan
    Lin, Weiqiang
    Lin, Haojie
    2017 13TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2017,
  • [36] Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering
    Zhang, Chihao
    Yang, Yang
    Zhou, Wei
    Zhang, Shihua
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (08) : 3701 - 3713
  • [37] Iterative subsampling in solution path clustering of noisy big data
    Marchetti, Yuliya
    Zhou, Qing
    STATISTICS AND ITS INTERFACE, 2016, 9 (04) : 415 - 431
  • [38] Big Data based User Clustering and Influence Power Ranking
    Jia, Yuwei
    Chao, Kun
    Cheng, Xinzhou
    Yuan, Mingqiang
    Mu, Mingjun
    2016 16TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES (ISCIT), 2016, : 371 - 375
  • [39] How to Use K-means for Big Data Clustering?
    Mussabayev, Rustam
    Mladenovic, Nenad
    Jarboui, Bassem
    Mussabayev, Ravil
    PATTERN RECOGNITION, 2023, 137
  • [40] The Survey on Approaches to Efficient Clustering and Classification Analysis of Big Data
    Gandhi, Bhagyashri S.
    Deshpande, Leena A.
    2016 INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION (ICCUBEA), 2016,