Iterative subsampling in solution path clustering of noisy big data

被引:2
|
作者
Marchetti, Yuliya [1 ]
Zhou, Qing [1 ]
机构
[1] Univ Calif Los Angeles, Dept Stat, 8125 Math Sci Bldg, Los Angeles, CA 90095 USA
基金
美国国家科学基金会;
关键词
Big data; Clustering; Sparse regularization; Subsampling; DISCRIMINANT-ANALYSIS; K-MEANS; IDENTIFICATION; ALGORITHM; NETWORK; OBJECTS;
D O I
10.4310/SII.2016.v9.n4.a2
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at http://www.stat.ucla.edu/similar to zhou/Software.html.
引用
收藏
页码:415 / 431
页数:17
相关论文
共 50 条
  • [41] Range-based Clustering Supporting Similarity Search in Big Data
    Trong Nhan Phan
    Jaeger, Markus
    Nadschlaeger, Stefan
    Kueng, Josef
    2015 26TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA), 2015, : 120 - 124
  • [42] Using Parallel Hierarchical Clustering to Address Spatial Big Data Challenges
    Woodley, Alan
    Tang, Ling-Xiang
    Geva, Shlomo
    Nayak, Richi
    Chappell, Timothy
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 2692 - 2698
  • [43] Approximate Clustering Ensemble Method for Big Data
    Mahmud, Mohammad Sultan
    Huang, Joshua Zhexue
    Ruby, Rukhsana
    Ngueilbaye, Alladoumbaye
    Wu, Kaishun
    IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (04) : 1142 - 1155
  • [44] A survey on parallel clustering algorithms for Big Data
    Zineb Dafir
    Yasmine Lamari
    Said Chah Slaoui
    Artificial Intelligence Review, 2021, 54 : 2411 - 2443
  • [45] p-PIC: Parallel power iteration clustering for big data
    Yan, Weizhong
    Brahmakshatriya, Umang
    Xue, Ya
    Gilder, Mark
    Wise, Bowden
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (03) : 352 - 359
  • [46] Batch Clustering Algorithm for Big Data Sets
    Alguliyev, Rasim
    Aliguliyev, Ramiz
    Bagirov, Adil
    Karimov, Rafael
    2016 IEEE 10TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2016, : 79 - 82
  • [47] Analysis of Mahout Big Data Clustering Algorithms
    Sharma, Ishan
    Tiwari, Rajeev
    Rana, Hukam Singh
    Anand, Abhineet
    INTELLIGENT COMMUNICATION, CONTROL AND DEVICES, ICICCD 2017, 2018, 624 : 999 - 1008
  • [48] External clustering validation in Big Data context
    Zerabi, Soumeya
    Meshoul, Souham
    PROCEEDINGS OF 2017 3RD INTERNATIONAL CONFERENCE OF CLOUD COMPUTING TECHNOLOGIES AND APPLICATIONS (CLOUDTECH), 2017, : 264 - 269
  • [49] Clustering on Big Data Using Hadoop MapReduce
    Akthar, Nadeem
    Ahamad, Mohd Vasim
    Khan, Shahbaz
    2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 789 - 795
  • [50] Big-Data Clustering with Genetic Algorithm
    Mortezanezhad, Afsaneh
    Daneshifar, Ebrahim
    2019 IEEE 5TH CONFERENCE ON KNOWLEDGE BASED ENGINEERING AND INNOVATION (KBEI 2019), 2019, : 702 - 706