Clustering Large Datasets by MergingK-Means Solutions

被引:16
作者
Melnykov, Volodymyr [1 ]
Michael, Semhar [2 ]
机构
[1] Univ Alabama, Dept Informat Syst Stat & Management Sci, Tuscaloosa, AL 35487 USA
[2] South Dakota State Univ, Dept Math & Stat, Brookings, SD 57007 USA
关键词
K-means; Finite mixture models; Merging components; Pairwise overlap; Classification EM algorithm; FINITE MIXTURE; COMPONENTS; ALGORITHM;
D O I
10.1007/s00357-019-09314-8
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Existing clustering methods range from simple but very restrictive to complex but more flexible. TheK-means algorithm is one of the most popular clustering procedures due to its computational speed and intuitive construction. Unfortunately, the application ofK-means in its traditional form based on Euclidean distances is limited to cases with spherical clusters of approximately the same volume and spread of points. Recent developments in the area of merging mixture components for clustering show good promise. We propose a general framework for hierarchical merging based on pairwise overlap between components which can be readily applied in the context of theK-means algorithm to produce meaningful clusters. Such an approach preserves the main advantage of theK-means algorithm-its speed. The developed ideas are illustrated on examples, studied through simulations, and applied to the problem of digit recognition.
引用
收藏
页码:97 / 123
页数:27
相关论文
共 36 条
[1]   A clustering algorithm for multivariate data streams with correlated components [J].
Aletti G. ;
Micheletti A. .
Journal of Big Data, 4 (1)
[2]  
Alimoglu F., 1996, P 5 TURK ART INT ART
[3]   Combining Mixture Components for Clustering [J].
Baudry, Jean-Patrick ;
Raftery, Adrian E. ;
Celeux, Gilles ;
Lo, Kenneth ;
Gottardo, Raphael .
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2010, 19 (02) :332-353
[4]   Model-based clustering of high-dimensional data: A review [J].
Bouveyron, Charles ;
Brunet-Saumard, Camille .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2014, 71 :52-78
[5]  
Calinski T., 1974, COMMUN STAT, V3, P1, DOI https://doi.org/10.1080/03610927408827101
[6]   MULTIVARIATE STUDY OF VARIATION IN 2 SPECIES OF ROCK CRAB OF GENUS LEPTOGRAPSUS [J].
CAMPBELL, NA ;
MAHON, RJ .
AUSTRALIAN JOURNAL OF ZOOLOGY, 1974, 22 (03) :417-425
[7]  
Celebi M.E., 2012, ARXIV12091960
[8]  
Celebi M. Emre, 2015, Partitional clustering algorithms, P79
[9]   A CLASSIFICATION EM ALGORITHM FOR CLUSTERING AND 2 STOCHASTIC VERSIONS [J].
CELEUX, G ;
GOVAERT, G .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1992, 14 (03) :315-332
[10]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38