Clusterability and Clustering of Images and Other "Real" High-Dimensional Data

被引:20
|
作者
Yellamraju, Tarun [1 ]
Boutin, Mireille [1 ]
机构
[1] Purdue Univ, Sch Elect & Comp Engn, W Lafayette, IN 47907 USA
基金
美国国家科学基金会;
关键词
Clustering; high-dimension; random projection; MODELS;
D O I
10.1109/TIP.2017.2789327
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering a high-dimensional data set is known to be very difficult. In this paper, we show that this is not the case when the points to cluster correspond to images. More specifically, image data sets are shown to have a lot of structures, so much, so that projecting the set onto a random 1D linear sub-space is likely to uncover a binary grouping among the images. Based on this observation, we propose a method to quantify the clusterability of a data set. The method is based on the probability density of a measure (S) of clusterability (in 1D) of the projection of the data onto a random line. After comparing the clusterability of image datasets with that of synthetically generated clusters, we conclude that these intriguing structures we find in image datasets do not fit the notion of clusters in the traditional sense. Further suggested by our observation is a fast method for clustering high-dimensional data in a hierarchical fashion; at each stage, the data is partitioned into two based on the binary clustering found in a 1D random projection of the data. Since most of the computations are performed in 1D, this approach is extremely efficient. But despite its simplicity, it achieves overall a better quality of clustering than existing high-dimensional clustering methods, not only for datasets representing image data, but for other real data sets as well. Our results highlight the need to re-examine our assumptions about high-dimensional clustering and the geometry of real datasets such as sets of images.
引用
收藏
页码:1927 / 1938
页数:12
相关论文
共 50 条
  • [21] MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data
    Ahlmann-Eltze, Constantin
    Yau, Christopher
    2018 IEEE 5TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2018, : 526 - 539
  • [22] Enhanced synchronization-inspired clustering for high-dimensional data
    Lei Chen
    Qinghua Guo
    Zhaohua Liu
    Shiwen Zhang
    Hongqiang Zhang
    Complex & Intelligent Systems, 2021, 7 : 203 - 223
  • [23] Persistent homology based clustering algorithm for high-dimensional data
    Xiong Z.
    Wei Y.
    Xiong Z.
    He K.
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2024, 52 (02): : 29 - 35
  • [24] Integrative clustering of high-dimensional data with joint and individual clusters
    Hellton, Kristoffer H.
    Thoresen, Magne
    BIOSTATISTICS, 2016, 17 (03) : 537 - 548
  • [25] Simultaneous clustering of individuals and covariates for high-dimensional longitudinal data
    Han, Chao
    Wu, Jiaqi
    Zhang, Weiping
    AUSTRALIAN & NEW ZEALAND JOURNAL OF STATISTICS, 2025, : 31 - 50
  • [26] Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering
    Kriegel, Hans-Peter
    Kroeger, Peer
    Zimek, Arthur
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2009, 3 (01)
  • [27] Ascending and Descending Order of Random Projections: Comparative Analysis of High-Dimensional Data Clustering
    Pasunuri, Raghunadh
    Venkaiah, Vadlamudi China
    Dhariyal, Bhaskar
    HARMONY SEARCH AND NATURE INSPIRED OPTIMIZATION ALGORITHMS, 2019, 741 : 133 - 142
  • [28] An intelligent clustering algorithm for high-dimensional multiview data in big data applications
    Tao, Qian
    Gu, Chunqin
    Wang, Zhenyu
    Jiang, Daoning
    NEUROCOMPUTING, 2020, 393 : 234 - 244
  • [29] Clustering High-Dimensional Data: A Reduction-Level Fusion of PCA and Random Projection
    Pasunuri, Raghunadh
    Venkaiah, Vadlamudi China
    Srivastava, Amit
    RECENT DEVELOPMENTS IN MACHINE LEARNING AND DATA ANALYTICS, 2019, 740 : 479 - 487
  • [30] A Survey on High-Dimensional Subspace Clustering
    Qu, Wentao
    Xiu, Xianchao
    Chen, Huangyue
    Kong, Lingchen
    MATHEMATICS, 2023, 11 (02)