Iterative subsampling in solution path clustering of noisy big data

被引:2
|
作者
Marchetti, Yuliya [1 ]
Zhou, Qing [1 ]
机构
[1] Univ Calif Los Angeles, Dept Stat, 8125 Math Sci Bldg, Los Angeles, CA 90095 USA
基金
美国国家科学基金会;
关键词
Big data; Clustering; Sparse regularization; Subsampling; DISCRIMINANT-ANALYSIS; K-MEANS; IDENTIFICATION; ALGORITHM; NETWORK; OBJECTS;
D O I
10.4310/SII.2016.v9.n4.a2
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at http://www.stat.ucla.edu/similar to zhou/Software.html.
引用
收藏
页码:415 / 431
页数:17
相关论文
共 50 条
  • [31] Solution path clustering with adaptive concave penalty
    Marchetti, Yuliya
    Zhou, Qing
    ELECTRONIC JOURNAL OF STATISTICS, 2014, 8 : 1569 - 1603
  • [32] How to Use K-means for Big Data Clustering?
    Mussabayev, Rustam
    Mladenovic, Nenad
    Jarboui, Bassem
    Mussabayev, Ravil
    PATTERN RECOGNITION, 2023, 137
  • [33] DENCLUE-IM: A New Approach for Big Data Clustering
    Rehioui, Hajar
    Idrissi, Abdellah
    Abourezq, Manar
    Zegrari, Faouzia
    7TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT 2016) / THE 6TH INTERNATIONAL CONFERENCE ON SUSTAINABLE ENERGY INFORMATION TECHNOLOGY (SEIT-2016) / AFFILIATED WORKSHOPS, 2016, 83 : 560 - 567
  • [34] New Approach for Clustering of Big Data: DisK-Means
    Saini, Anu
    Minocha, Jagrit
    Ubriani, Jaypriya
    Sharma, Dhruv
    2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), 2016, : 122 - 126
  • [35] HdK-Means: Hadoop Based Parallel K-Means Clustering for Big Data
    Bandyopadhyay, Soumyendu Sekhar
    Halder, Anup Kumar
    Chatterjee, Piyali
    Nasipuri, Mita
    Basu, Subhadip
    2017 IEEE CALCUTTA CONFERENCE (CALCON), 2017, : 452 - 456
  • [36] Research on complex attribute big data classification based on iterative fuzzy clustering algorithm
    Qian, Li
    WEB INTELLIGENCE, 2021, 19 (1-2) : 147 - 158
  • [37] Iterative Path Clustering for Software Fault Localization
    Chen, Rong
    Chen, Shifeng
    Zhang, Nan
    2016 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C 2016), 2016, : 292 - 297
  • [38] Optimized Ensembles for Clustering Noisy Data
    Breaban, Mihaela Elena
    LEARNING AND INTELLIGENT OPTIMIZATION, 2010, 6073 : 220 - 223
  • [39] Competitive algorithms for the clustering of noisy data
    Yang, TN
    Wang, SD
    FUZZY SETS AND SYSTEMS, 2004, 141 (02) : 281 - 299
  • [40] K-MEANS plus : A DEVELOPED CLUSTERING ALGORITHM FOR BIG DATA
    Niu, Kun
    Gao, Zhipeng
    Jiao, Haizhen
    Deng, Nanjie
    PROCEEDINGS OF 2016 4TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (IEEE CCIS 2016), 2016, : 141 - 144