Clusterability and Clustering of Images and Other "Real" High-Dimensional Data

被引：20

作者：

Yellamraju, Tarun ^{[1
]}

Boutin, Mireille ^{[1
]}

机构：

[1] Purdue Univ, Sch Elect & Comp Engn, W Lafayette, IN 47907 USA

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2018年 / 27卷 / 04期

基金：

美国国家科学基金会;

关键词：

Clustering; high-dimension; random projection; MODELS;

D O I：

10.1109/TIP.2017.2789327

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Clustering a high-dimensional data set is known to be very difficult. In this paper, we show that this is not the case when the points to cluster correspond to images. More specifically, image data sets are shown to have a lot of structures, so much, so that projecting the set onto a random 1D linear sub-space is likely to uncover a binary grouping among the images. Based on this observation, we propose a method to quantify the clusterability of a data set. The method is based on the probability density of a measure (S) of clusterability (in 1D) of the projection of the data onto a random line. After comparing the clusterability of image datasets with that of synthetically generated clusters, we conclude that these intriguing structures we find in image datasets do not fit the notion of clusters in the traditional sense. Further suggested by our observation is a fast method for clustering high-dimensional data in a hierarchical fashion; at each stage, the data is partitioned into two based on the binary clustering found in a 1D random projection of the data. Since most of the computations are performed in 1D, this approach is extremely efficient. But despite its simplicity, it achieves overall a better quality of clustering than existing high-dimensional clustering methods, not only for datasets representing image data, but for other real data sets as well. Our results highlight the need to re-examine our assumptions about high-dimensional clustering and the geometry of real datasets such as sets of images.

引用

页码：1927 / 1938

页数：12

共 50 条

[1] Model based clustering of high-dimensional binary data
Tang, Yang
Browne, Ryan P.
Mc Nicholas, Paul D.
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2015, 87 : 84 - 101
[2] The Role of Hubness in Clustering High-Dimensional Data
Tomasev, Nenad
Radovanovic, Milos
Mladenic, Dunja
Ivanovic, Mirjana
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (03) : 739 - 751
[3] Clustering High-Dimensional Noisy Categorical Data
Tian, Zhiyi
Xu, Jiaming
Tang, Jen
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2024, 119 (548) : 3008 - 3019
[4] Clustering of imbalanced high-dimensional media data
Šárka Brodinová
Maia Zaharieva
Peter Filzmoser
Thomas Ortner
Christian Breiteneder
Advances in Data Analysis and Classification, 2018, 12 : 261 - 284
[5] Clustering Lines in High-Dimensional Space: Classification of Incomplete Data
Gao, Jie
Langberg, Michael
Schulman, Leonard J.
ACM TRANSACTIONS ON ALGORITHMS, 2010, 7 (01)
[6] Optimal variable clustering for high-dimensional matrix valued data
Lee, Inbeom
Deng, Siyi
Ning, Yang
INFORMATION AND INFERENCE-A JOURNAL OF THE IMA, 2025, 14 (01)
[7] Clustering of imbalanced high-dimensional media data
Brodinova, Sarka
Zaharieva, Maia
Filzmoser, Peter
Ortner, Thomas
Breiteneder, Christian
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2018, 12 (02) : 261 - 284
[8] HDG-Tree: A Structure for Clustering High-Dimensional Data Streams
Ren, Jiadong
Li, Lining
Xia, Yan
Ren, Jiadong
2009 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL 2, PROCEEDINGS, 2009, : 594 - +
[9] Fuzzy nearest neighbor clustering of high-dimensional data
Wang, HB
Yu, YQ
Zhou, DR
Meng, B
2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 2569 - 2572
[10] Iterative random projections for high-dimensional data clustering
Cardoso, Angelo
Wichert, Andreas
PATTERN RECOGNITION LETTERS, 2012, 33 (13) : 1749 - 1755

← 1 2 3 4 5 →