Fast agglomerative clustering using approximate traveling salesman solutions

被引:1
作者
Sieranoja, Sami [1 ]
Franti, Pasi [1 ]
机构
[1] Univ Eastern Finland, Sch Comp, POB 111,N80101, Joensuu, Finland
基金
芬兰科学院;
关键词
Data clustering; Large-scale data; Travelling salesman problem; VECTOR QUANTIZATION; ALGORITHM;
D O I
10.1186/s40537-024-01053-x
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Agglomerative clustering, and Ward's method, in particular, provide good clustering accuracy for most applications. However, its adoption has been limited by its quadratic time complexity, which makes it slow for large datasets. It also consumes O(N2) memory for non-vectorial data. In this work, we propose a fast variant of Ward's method that reduces the number of distance calculations and only needs O(N) memory. It works by constraining the merges to neighboring nodes on a novel fully connected graph that consists of multiple approximate traveling salesman problem (TSP) solutions. This avoids the problems caused by disconnected graph components that can occur with a k-nearest neighbors graph. The method is general and works for all types of data for which a distance function can be defined. For a dataset of size 1.28 million, the proposed method achieves a speedup factor of 25:1 compared with the best exact implementation of Ward's method and obtains identical results. The algorithm allows a flexible compromise between clustering quality and running time. For a moderate degradation in quality (NMI 0.90 to 0.80), the speedup factor improves further to 625:1. The proposed TSP-graph can also be used in other applications that require a connected neighborhood graph.
引用
收藏
页数:18
相关论文
共 39 条
[1]   Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics [J].
Ambroise, Christophe ;
Dehman, Alia ;
Neuvial, Pierre ;
Rigaill, Guillem ;
Vialaneix, Nathalie .
ALGORITHMS FOR MOLECULAR BIOLOGY, 2019, 14 (01)
[2]   New upper bounds for kissing numbers from semidefinite programming [J].
Bachoc, Christine ;
Vallentin, Frank .
JOURNAL OF THE AMERICAN MATHEMATICAL SOCIETY, 2008, 21 (03) :909-924
[3]  
Batagelj V., 1988, Generalized Ward and Related Clustering Problems, P67
[4]   Efficient agglomerative hierarchical clustering [J].
Bouguettaya, Athman ;
Yu, Qi ;
Liu, Xumin ;
Zhou, Xiangmin ;
Song, Andy .
EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (05) :2785-2797
[5]  
Chieng H. H., 2014, RECENT ADV SOFT COMP, P89, DOI DOI 10.1007/978-3-319-07692-89
[6]  
Cormen T.H., 2022, Introduction to Algorithms
[7]   Comparing community structure identification -: art. no. P09008 [J].
Danon, L ;
Díaz-Guilera, A ;
Duch, J ;
Arenas, A .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2005, :219-228
[8]   A NEW VECTOR QUANTIZATION CLUSTERING-ALGORITHM [J].
EQUITZ, WH .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1989, 37 (10) :1568-1575
[9]   Mutual proximity graphs for improved reachability in music recommendation [J].
Flexer, Arthur ;
Stevens, Jeff .
JOURNAL OF NEW MUSIC RESEARCH, 2018, 47 (01) :17-28
[10]   Fast and memory efficient implementation of the exact PNN [J].
Fränti, P ;
Kaukoranta, T ;
Shen, DF ;
Chang, KS .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2000, 9 (05) :773-777