Iterative subsampling in solution path clustering of noisy big data

被引:2
|
作者
Marchetti, Yuliya [1 ]
Zhou, Qing [1 ]
机构
[1] Univ Calif Los Angeles, Dept Stat, 8125 Math Sci Bldg, Los Angeles, CA 90095 USA
基金
美国国家科学基金会;
关键词
Big data; Clustering; Sparse regularization; Subsampling; DISCRIMINANT-ANALYSIS; K-MEANS; IDENTIFICATION; ALGORITHM; NETWORK; OBJECTS;
D O I
10.4310/SII.2016.v9.n4.a2
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at http://www.stat.ucla.edu/similar to zhou/Software.html.
引用
收藏
页码:415 / 431
页数:17
相关论文
共 50 条
  • [1] PARSUC: A Parallel Subsampling-Based Method for Clustering Remote Sensing Big Data
    Xia, Huiyu
    Huang, Wei
    Li, Ning
    Zhou, Jianzhong
    Zhang, Dongying
    SENSORS, 2019, 19 (15)
  • [2] Iterative big data clustering algorithms: a review
    Mohebi, Amin
    Aghabozorgi, Saeed
    Teh Ying Wah
    Herawan, Tutut
    Yahyapour, Ramin
    SOFTWARE-PRACTICE & EXPERIENCE, 2016, 46 (01) : 107 - 129
  • [3] A Data Science and Engineering Solution for Fast k-Means Clustering of Big Data
    Dierckens, Karl E.
    Harrison, Adrian B.
    Leung, Carson K.
    Pind, Adrienne V.
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS / 11TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING / 14TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS, 2017, : 925 - 932
  • [4] Iterative Unified Clustering in Big Data
    Misal, Vasundhara
    Janeja, Vandana P.
    Pallaprolu, Sai C.
    Yesha, Yelena
    Chintalapati, Raghu
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3412 - 3421
  • [5] A Novel Clustering Technique for Efficient Clustering of Big Data in Hadoop Ecosystem
    Kumar, Sunil
    Singh, Maninder
    BIG DATA MINING AND ANALYTICS, 2019, 2 (04): : 240 - 247
  • [6] Big Data and Clustering Algorithms
    Ajin, V. W.
    Kumar, Lekshmy D.
    2016 INTERNATIONAL CONFERENCE ON RESEARCH ADVANCES IN INTEGRATED NAVIGATION SYSTEMS (RAINS), 2016,
  • [7] Big Data Clustering: A Review
    Shirkhorshidi, Ali Seyed
    Aghabozorgi, Saeed
    Teh, Ying Wah
    Herawan, Tutut
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2014, PT V, 2014, 8583 : 707 - 720
  • [8] MapReduce Clustering for Big Data
    Ghattas, Badih
    Pinto, Antoine
    Diao, Sambou
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 5116 - 5124
  • [9] Implementing Clustering and Classification Approaches for Big Data with MATLAB
    Pitz, Katrin
    Anderl, Reiner
    PROCEEDINGS OF THE FUTURE TECHNOLOGIES CONFERENCE (FTC) 2018, VOL 1, 2019, 880 : 458 - 480
  • [10] Scaling by subsampling for big data, with applications to statistical learning
    Bertail, Patrice
    Bouchouia, Mohammed
    Jelassi, Ons
    Tressou, Jessica
    Zetlaoui, Melanie
    JOURNAL OF NONPARAMETRIC STATISTICS, 2024, 36 (01) : 78 - 117