Clusterability and Clustering of Images and Other "Real" High-Dimensional Data

被引:20
|
作者
Yellamraju, Tarun [1 ]
Boutin, Mireille [1 ]
机构
[1] Purdue Univ, Sch Elect & Comp Engn, W Lafayette, IN 47907 USA
基金
美国国家科学基金会;
关键词
Clustering; high-dimension; random projection; MODELS;
D O I
10.1109/TIP.2017.2789327
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering a high-dimensional data set is known to be very difficult. In this paper, we show that this is not the case when the points to cluster correspond to images. More specifically, image data sets are shown to have a lot of structures, so much, so that projecting the set onto a random 1D linear sub-space is likely to uncover a binary grouping among the images. Based on this observation, we propose a method to quantify the clusterability of a data set. The method is based on the probability density of a measure (S) of clusterability (in 1D) of the projection of the data onto a random line. After comparing the clusterability of image datasets with that of synthetically generated clusters, we conclude that these intriguing structures we find in image datasets do not fit the notion of clusters in the traditional sense. Further suggested by our observation is a fast method for clustering high-dimensional data in a hierarchical fashion; at each stage, the data is partitioned into two based on the binary clustering found in a 1D random projection of the data. Since most of the computations are performed in 1D, this approach is extremely efficient. But despite its simplicity, it achieves overall a better quality of clustering than existing high-dimensional clustering methods, not only for datasets representing image data, but for other real data sets as well. Our results highlight the need to re-examine our assumptions about high-dimensional clustering and the geometry of real datasets such as sets of images.
引用
收藏
页码:1927 / 1938
页数:12
相关论文
共 50 条
  • [31] Dating the break in high-dimensional data
    Wang, Runmin
    Shao, Xiaofeng
    BERNOULLI, 2023, 29 (04) : 2879 - 2901
  • [32] Accelerating Density-Based Subspace Clustering in High-Dimensional Data
    Prinzbach, Juergen
    Lauer, Tobias
    Kiefer, Nicolas
    21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS ICDMW 2021, 2021, : 474 - 481
  • [33] A GA-based Feature Selection for High-dimensional Data Clustering
    Sun, Mei
    Xiong, Langhuan
    Sun, Haojun
    Jiang, Dazhi
    THIRD INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTING, 2009, : 769 - 772
  • [34] A PROBABILISTIC l1 METHOD FOR CLUSTERING HIGH-DIMENSIONAL DATA
    Asamov, Tsvetan
    Ben-Israel, Adi
    PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2022, 36 (02) : 433 - 448
  • [35] Parameter-wise co-clustering for high-dimensional data
    Gallaugher, M. P. B.
    Biernacki, C.
    McNicholas, P. D.
    COMPUTATIONAL STATISTICS, 2023, 38 (03) : 1597 - 1619
  • [36] Adaptive multi-view subspace clustering for high-dimensional data
    Yan, Fei
    Wang, Xiao-dong
    Zeng, Zhi-qiang
    Hong, Chao-qun
    PATTERN RECOGNITION LETTERS, 2020, 130 : 299 - 305
  • [37] Subspace Clustering in High-Dimensional Data Streams: A Systematic Literature Review
    Ghani, Nur Laila Ab
    Aziz, Izzatdin Abdul
    AbdulKadir, Said Jadid
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (02): : 4649 - 4668
  • [38] Local gap density for clustering high-dimensional data with varying densities
    Li, Ruijia
    Yang, Xiaofei
    Qin, Xiaolong
    Zhu, William
    KNOWLEDGE-BASED SYSTEMS, 2019, 184
  • [39] Redefining clustering for high-dimensional applications
    Aggarwal, CC
    Yu, PS
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (02) : 210 - 225
  • [40] Storage-optimizing clustering algorithms for high-dimensional tick data
    Buza, Krisztian
    Nagy, Gabor I.
    Nanopoulos, Alexandros
    EXPERT SYSTEMS WITH APPLICATIONS, 2014, 41 (09) : 4148 - 4157