Iterative subsampling in solution path clustering of noisy big data

被引：2

作者：

Marchetti, Yuliya ^{[1
]}

Zhou, Qing ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Dept Stat, 8125 Math Sci Bldg, Los Angeles, CA 90095 USA

来源：

STATISTICS AND ITS INTERFACE | 2016年 / 9卷 / 04期

基金：

美国国家科学基金会;

关键词：

Big data; Clustering; Sparse regularization; Subsampling; DISCRIMINANT-ANALYSIS; K-MEANS; IDENTIFICATION; ALGORITHM; NETWORK; OBJECTS;

D O I：

10.4310/SII.2016.v9.n4.a2

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at http://www.stat.ucla.edu/similar to zhou/Software.html.

引用

页码：415 / 431

页数：17

共 50 条

[31] Solution path clustering with adaptive concave penalty
Marchetti, Yuliya
Zhou, Qing
ELECTRONIC JOURNAL OF STATISTICS, 2014, 8 : 1569 - 1603
[32] How to Use K-means for Big Data Clustering?
Mussabayev, Rustam
Mladenovic, Nenad
Jarboui, Bassem
Mussabayev, Ravil
PATTERN RECOGNITION, 2023, 137
[33] DENCLUE-IM: A New Approach for Big Data Clustering
Rehioui, Hajar
Idrissi, Abdellah
Abourezq, Manar
Zegrari, Faouzia
7TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT 2016) / THE 6TH INTERNATIONAL CONFERENCE ON SUSTAINABLE ENERGY INFORMATION TECHNOLOGY (SEIT-2016) / AFFILIATED WORKSHOPS, 2016, 83 : 560 - 567
[34] New Approach for Clustering of Big Data: DisK-Means
Saini, Anu
Minocha, Jagrit
Ubriani, Jaypriya
Sharma, Dhruv
2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), 2016, : 122 - 126
[35] HdK-Means: Hadoop Based Parallel K-Means Clustering for Big Data
Bandyopadhyay, Soumyendu Sekhar
Halder, Anup Kumar
Chatterjee, Piyali
Nasipuri, Mita
Basu, Subhadip
2017 IEEE CALCUTTA CONFERENCE (CALCON), 2017, : 452 - 456
[36] Research on complex attribute big data classification based on iterative fuzzy clustering algorithm
Qian, Li
WEB INTELLIGENCE, 2021, 19 (1-2) : 147 - 158
[37] Iterative Path Clustering for Software Fault Localization
Chen, Rong
Chen, Shifeng
Zhang, Nan
2016 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C 2016), 2016, : 292 - 297
[38] Optimized Ensembles for Clustering Noisy Data
Breaban, Mihaela Elena
LEARNING AND INTELLIGENT OPTIMIZATION, 2010, 6073 : 220 - 223
[39] Competitive algorithms for the clustering of noisy data
Yang, TN
Wang, SD
FUZZY SETS AND SYSTEMS, 2004, 141 (02) : 281 - 299
[40] K-MEANS plus : A DEVELOPED CLUSTERING ALGORITHM FOR BIG DATA
Niu, Kun
Gao, Zhipeng
Jiao, Haizhen
Deng, Nanjie
PROCEEDINGS OF 2016 4TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (IEEE CCIS 2016), 2016, : 141 - 144

← 1 2 3 4 5 →