Enhancing cluster analysis via topological manifold learning

被引：4

作者：

Herrmann, Moritz ^{[1
,3
,4
]}

Kazempour, Daniyal ^{[2
]}

Scheipl, Fabian ^{[1
,3
]}

Kroeger, Peer ^{[2
]}

机构：

[1] Ludwig Maximilians Univ Munchen, Dept Stat, Ludwigstr 33, D-80539 Munich, Bavaria, Germany

[2] Christian Albrechts Univ Kiel, Dept Comp Sci, Christian Albrechts Pl 4, D-24098 Kiel, Schleswig Holst, Germany

[3] Munich Ctr Machine Learning, Munich, Germany

[4] Ludwig Maximilians Univ Munchen, Inst Med Informat Proc Biometry & Epidemiol, Marchioninistr 15, D-81377 Munich, Bavaria, Germany

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 2024年 / 38卷 / 03期

关键词：

Cluster analysis; Manifold learning; Topological data analysis; DIMENSIONALITY REDUCTION; K-MEANS; HOMOLOGY; PCA;

D O I：

10.1007/s10618-023-00980-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: we show that clustering embedding vectors representing the inherent structure of a dataset instead of the observed feature vectors themselves is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how separable the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. The approach is successful because it performs the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.

引用

页码：840 / 887

页数：48

共 103 条

[1] COMPARATIVE-ANALYSIS OF STATISTICAL PATTERN-RECOGNITION METHODS IN HIGH-DIMENSIONAL SETTINGS [J].

AEBERHARD, S ;

COOMANS, D ;

DEVEL, O .

PATTERN RECOGNITION, 1994, 27 (08) :1065-1077

[2]

Aggarwal, 2015, DATA MINING, DOI DOI 10.1007/978-3-319-14142-8

[3]

Aggarwal CC, 2014, CH CRC DATA MIN KNOW, P1

[4]

Aggarwal CC, 2014, CH CRC DATA MIN KNOW, P1

[5]

Alaiz CM, 2015, COMPUT INTELL-US, V6

[6]

Alimoglu F., 2001, Turkish Journal Electrical Engineering and Computer Sciences, Elektrik, V9, P1

[7]

Allaoui M., 2020, Image and Signal Processing, P317, DOI [DOI 10.1007/978-3-030-51935-3, 10.1007/978-3-030-51935-3_34, DOI 10.1007/978-3-030-51935-3_34, 10.1007/978-3-030-51935-334]

[8]

ANDERSON E, 1935, B AM IRIS SOC, V59, P2, DOI DOI 10.1007/978-1-4612-5098-2-2

[9]

Ankerst M., 1999, SIGMOD Record, V28, P49, DOI 10.1145/304181.304187

[10]

Arias-Castro E, 2017, J MACH LEARN RES, V18, P1

← 1 2 3 4 5 6 7 8 9 10 →