Ultra-Scalable Spectral Clustering and Ensemble Clustering

被引:357
作者
Huang, Dong [1 ,2 ]
Wang, Chang-Dong [3 ,4 ,5 ]
Wu, Jian-Sheng [6 ]
Lai, Jian-Huang [3 ,4 ,5 ]
Kwoh, Chee-Keong [2 ]
机构
[1] South China Agr Univ, Coll Math & Informat, Guangzhou 510640, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
[3] Sun Yat Sen Univ, Sch Data & Comp Sci, Guangzhou 510006, Peoples R China
[4] Guangdong Key Lab Informat Secur Technol, Guangzhou 510006, Peoples R China
[5] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China
[6] Nanchang Univ, Sch Informat Engn, Nanchang 330031, Jiangxi, Peoples R China
关键词
Clustering algorithms; Sparse matrices; Complexity theory; Robustness; Bipartite graph; Scalability; Approximation algorithms; Data clustering; large-scale clustering; spectral clustering; ensemble clustering; large-scale datasets; nonlinearly separable datasets; COMBINING MULTIPLE CLUSTERINGS;
D O I
10.1109/TKDE.2019.2903410
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for $K$K-nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC's, a new bipartite graph is constructed between objects and base clusters and then efficiently partitioned to achieve the consensus clustering result. It is noteworthy that both U-SPEC and U-SENC have nearly linear time and space complexity, and are capable of robustly and efficiently partitioning 10-million-level nonlinearly-separable datasets on a PC with 64 GB memory. Experiments on various large-scale datasets have demonstrated the scalability and robustness of our algorithms. The MATLAB code and experimental data are available at https://www.researchgate.net/publication/330760669.
引用
收藏
页码:1212 / 1226
页数:15
相关论文
共 35 条
[1]  
[Anonymous], Matrix Computations
[2]   RNN-DBSCAN: A Density-Based Clustering Algorithm Using Reverse Nearest Neighbor Density Estimates [J].
Bryant, Avory ;
Cios, Krzysztof .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (06) :1109-1121
[3]  
Cai D., 2011, Litekmeans: the fastest matlab implementation of kmeans, 311
[4]   Large Scale Spectral Clustering Via Landmark-Based Sparse Representation [J].
Cai, Deng ;
Chen, Xinlei .
IEEE TRANSACTIONS ON CYBERNETICS, 2015, 45 (08) :1669-1680
[5]   Parallel Spectral Clustering in Distributed Systems [J].
Chen, Wen-Yen ;
Song, Yangqiu ;
Bai, Hongjie ;
Lin, Chih-Jen ;
Chang, Edward Y. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (03) :568-586
[6]  
Dong W., 2011, P 20 INT C WORLD WID, P577
[7]  
Fern X.Z., 2004, P 21 INT C MACH LEAR, P36
[8]   Ensemble clustering by means of clustering embedding in vector spaces [J].
Franek, Lucas ;
Jiang, Xiaoyi .
PATTERN RECOGNITION, 2014, 47 (02) :833-842
[9]   Combining multiple clusterings using evidence accumulation [J].
Fred, ALN ;
Jain, AK .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (06) :835-850
[10]   Fast Large-Scale Spectral Clustering via Explicit Feature Mapping [J].
He, Li ;
Ray, Nilanjan ;
Guan, Yisheng ;
Zhang, Hong .
IEEE TRANSACTIONS ON CYBERNETICS, 2019, 49 (03) :1058-1071