Iterative subsampling in solution path clustering of noisy big data

被引：2

作者：

Marchetti, Yuliya ^{[1
]}

Zhou, Qing ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Dept Stat, 8125 Math Sci Bldg, Los Angeles, CA 90095 USA

来源：

STATISTICS AND ITS INTERFACE | 2016年 / 9卷 / 04期

基金：

美国国家科学基金会;

关键词：

Big data; Clustering; Sparse regularization; Subsampling; DISCRIMINANT-ANALYSIS; K-MEANS; IDENTIFICATION; ALGORITHM; NETWORK; OBJECTS;

D O I：

10.4310/SII.2016.v9.n4.a2

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at http://www.stat.ucla.edu/similar to zhou/Software.html.

引用

页码：415 / 431

页数：17

共 50 条

[41] Range-based Clustering Supporting Similarity Search in Big Data
Trong Nhan Phan
Jaeger, Markus
Nadschlaeger, Stefan
Kueng, Josef
2015 26TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA), 2015, : 120 - 124
[42] Using Parallel Hierarchical Clustering to Address Spatial Big Data Challenges
Woodley, Alan
Tang, Ling-Xiang
Geva, Shlomo
Nayak, Richi
Chappell, Timothy
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 2692 - 2698
[43] Approximate Clustering Ensemble Method for Big Data
Mahmud, Mohammad Sultan
Huang, Joshua Zhexue
Ruby, Rukhsana
Ngueilbaye, Alladoumbaye
Wu, Kaishun
IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (04) : 1142 - 1155
[44] A survey on parallel clustering algorithms for Big Data
Zineb Dafir
Yasmine Lamari
Said Chah Slaoui
Artificial Intelligence Review, 2021, 54 : 2411 - 2443
[45] p-PIC: Parallel power iteration clustering for big data
Yan, Weizhong
Brahmakshatriya, Umang
Xue, Ya
Gilder, Mark
Wise, Bowden
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (03) : 352 - 359
[46] Batch Clustering Algorithm for Big Data Sets
Alguliyev, Rasim
Aliguliyev, Ramiz
Bagirov, Adil
Karimov, Rafael
2016 IEEE 10TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2016, : 79 - 82
[47] Analysis of Mahout Big Data Clustering Algorithms
Sharma, Ishan
Tiwari, Rajeev
Rana, Hukam Singh
Anand, Abhineet
INTELLIGENT COMMUNICATION, CONTROL AND DEVICES, ICICCD 2017, 2018, 624 : 999 - 1008
[48] External clustering validation in Big Data context
Zerabi, Soumeya
Meshoul, Souham
PROCEEDINGS OF 2017 3RD INTERNATIONAL CONFERENCE OF CLOUD COMPUTING TECHNOLOGIES AND APPLICATIONS (CLOUDTECH), 2017, : 264 - 269
[49] Clustering on Big Data Using Hadoop MapReduce
Akthar, Nadeem
Ahamad, Mohd Vasim
Khan, Shahbaz
2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 789 - 795
[50] Big-Data Clustering with Genetic Algorithm
Mortezanezhad, Afsaneh
Daneshifar, Ebrahim
2019 IEEE 5TH CONFERENCE ON KNOWLEDGE BASED ENGINEERING AND INNOVATION (KBEI 2019), 2019, : 702 - 706

← 1 2 3 4 5 →