Iterative subsampling in solution path clustering of noisy big data

被引：2

作者：

Marchetti, Yuliya ^{[1
]}

Zhou, Qing ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Dept Stat, 8125 Math Sci Bldg, Los Angeles, CA 90095 USA

来源：

STATISTICS AND ITS INTERFACE | 2016年 / 9卷 / 04期

基金：

美国国家科学基金会;

关键词：

Big data; Clustering; Sparse regularization; Subsampling; DISCRIMINANT-ANALYSIS; K-MEANS; IDENTIFICATION; ALGORITHM; NETWORK; OBJECTS;

D O I：

10.4310/SII.2016.v9.n4.a2

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at http://www.stat.ucla.edu/similar to zhou/Software.html.

引用

页码：415 / 431

页数：17

共 50 条

[1] PARSUC: A Parallel Subsampling-Based Method for Clustering Remote Sensing Big Data
Xia, Huiyu
Huang, Wei
Li, Ning
Zhou, Jianzhong
Zhang, Dongying
SENSORS, 2019, 19 (15)
[2] Iterative big data clustering algorithms: a review
Mohebi, Amin
Aghabozorgi, Saeed
Teh Ying Wah
Herawan, Tutut
Yahyapour, Ramin
SOFTWARE-PRACTICE & EXPERIENCE, 2016, 46 (01) : 107 - 129
[3] A Data Science and Engineering Solution for Fast k-Means Clustering of Big Data
Dierckens, Karl E.
Harrison, Adrian B.
Leung, Carson K.
Pind, Adrienne V.
2017 16TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS / 11TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING / 14TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS, 2017, : 925 - 932
[4] Iterative Unified Clustering in Big Data
Misal, Vasundhara
Janeja, Vandana P.
Pallaprolu, Sai C.
Yesha, Yelena
Chintalapati, Raghu
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3412 - 3421
[5] A Novel Clustering Technique for Efficient Clustering of Big Data in Hadoop Ecosystem
Kumar, Sunil
Singh, Maninder
BIG DATA MINING AND ANALYTICS, 2019, 2 (04): : 240 - 247
[6] Big Data and Clustering Algorithms
Ajin, V. W.
Kumar, Lekshmy D.
2016 INTERNATIONAL CONFERENCE ON RESEARCH ADVANCES IN INTEGRATED NAVIGATION SYSTEMS (RAINS), 2016,
[7] Big Data Clustering: A Review
Shirkhorshidi, Ali Seyed
Aghabozorgi, Saeed
Teh, Ying Wah
Herawan, Tutut
COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2014, PT V, 2014, 8583 : 707 - 720
[8] MapReduce Clustering for Big Data
Ghattas, Badih
Pinto, Antoine
Diao, Sambou
2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 5116 - 5124
[9] Implementing Clustering and Classification Approaches for Big Data with MATLAB
Pitz, Katrin
Anderl, Reiner
PROCEEDINGS OF THE FUTURE TECHNOLOGIES CONFERENCE (FTC) 2018, VOL 1, 2019, 880 : 458 - 480
[10] Scaling by subsampling for big data, with applications to statistical learning
Bertail, Patrice
Bouchouia, Mohammed
Jelassi, Ons
Tressou, Jessica
Zetlaoui, Melanie
JOURNAL OF NONPARAMETRIC STATISTICS, 2024, 36 (01) : 78 - 117

← 1 2 3 4 5 →