p-PIC: Parallel power iteration clustering for big data

被引:26
|
作者
Yan, Weizhong [1 ]
Brahmakshatriya, Umang [1 ]
Xue, Ya [1 ]
Gilder, Mark [2 ]
Wise, Bowden [3 ]
机构
[1] GE Global Res Ctr, Machine Learning Lab, Niskayuna, NY 12039 USA
[2] GE Global Res Ctr, Comp & Cyber Secur Lab, Niskayuna, NY 12039 USA
[3] GE Global Res Ctr, Knowledge Discovery Lab, Niskayuna, NY 12039 USA
关键词
Big data; Clustering; Cloud computing; Data-mining; Distributed computing; Machine learning; Parallel computing; Spectral clustering; ALGORITHM;
D O I
10.1016/j.jpdc.2012.06.009
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Power iteration clustering (PIC) is a newly developed clustering algorithm. It performs clustering by embedding data points in a low-dimensional subspace derived from the similarity matrix. Compared to traditional clustering algorithms, PIC is simple, fast and relatively scalable. However, it requires the data and its associated similarity matrix fit into memory, which makes the algorithm infeasible for big data applications. This paper attempts to expand PIC's data scalability by implementing a parallel power iteration clustering (p-PIC). While this paper focuses on exploring different parallelization strategies and implementation details for minimizing computation and communication costs, we have also paid great attention to ensuring the algorithm works well on low-end commodity computers (COTS-based clusters and general purpose servers found at most commercial cloud providers). The experimental results demonstrate that the proposed p-PIC algorithm is highly scalable to both data and compute resources. (c) 2012 Elsevier Inc. All rights reserved.
引用
收藏
页码:352 / 359
页数:8
相关论文
共 50 条
  • [21] Fast and effective Big Data exploration by clustering
    Ianni, Michele
    Masciari, Elio
    Mazzeo, Giuseppe M.
    Mezzanzanica, Mario
    Zaniolo, Carlo
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 102 : 84 - 94
  • [22] Adaptive Power Iteration Clustering
    Liu, Bo
    Liu, Yong
    Zhang, Huiyan
    Xu, Yonghui
    Tang, Can
    Tang, Lianggui
    Qin, Huafeng
    Miao, Chunyan
    KNOWLEDGE-BASED SYSTEMS, 2021, 225
  • [23] Parallel Clustering of Big Data of Spatio-temporal Trajectory
    Hu, Chunchun
    Kang, Xionghua
    Luo, Nianxue
    Zhao, Qiansheng
    2015 11TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2015, : 769 - 774
  • [24] Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies
    Tsai, Chih-Fong
    Lin, Wei-Chao
    Ke, Shih-Wen
    JOURNAL OF SYSTEMS AND SOFTWARE, 2016, 122 : 83 - 92
  • [25] Strategies for Big Data Clustering
    Kurasova, Olga
    Marcinkevicius, Virginijus
    Medvedev, Viktor
    Rapecka, Aurimas
    Stefanovic, Pavel
    2014 IEEE 26TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2014, : 740 - 747
  • [26] Big Data and Clustering Algorithms
    Ajin, V. W.
    Kumar, Lekshmy D.
    2016 INTERNATIONAL CONFERENCE ON RESEARCH ADVANCES IN INTEGRATED NAVIGATION SYSTEMS (RAINS), 2016,
  • [27] Big Data clustering validity
    Tlili, Monia
    Hamdani, Tarek M.
    2014 6TH INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2014, : 348 - 352
  • [28] The application of parallel clustering analysis based on big data mining in physical community discovery
    Wu, Fan
    Zhou, Rui
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2022, 13 (SUPPL 3) : 1054 - 1062
  • [29] A Modified Hybrid Fuzzy Clustering Method for Big Data
    Khoshkbarchi, Amir
    Kamali, Ali
    Amjadi, Mehdi
    Haeri, Maryam Amir
    2016 8TH INTERNATIONAL SYMPOSIUM ON TELECOMMUNICATIONS (IST), 2016, : 196 - 201
  • [30] Clustering Application for Streaming Big Data in Smart Grid
    Banga, Alisha
    Sinha, Amrita
    PROCEEDINGS OF THE 2018 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), 2018, : 1051 - 1054