Clusterability and Clustering of Images and Other "Real" High-Dimensional Data

被引:20
|
作者
Yellamraju, Tarun [1 ]
Boutin, Mireille [1 ]
机构
[1] Purdue Univ, Sch Elect & Comp Engn, W Lafayette, IN 47907 USA
基金
美国国家科学基金会;
关键词
Clustering; high-dimension; random projection; MODELS;
D O I
10.1109/TIP.2017.2789327
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering a high-dimensional data set is known to be very difficult. In this paper, we show that this is not the case when the points to cluster correspond to images. More specifically, image data sets are shown to have a lot of structures, so much, so that projecting the set onto a random 1D linear sub-space is likely to uncover a binary grouping among the images. Based on this observation, we propose a method to quantify the clusterability of a data set. The method is based on the probability density of a measure (S) of clusterability (in 1D) of the projection of the data onto a random line. After comparing the clusterability of image datasets with that of synthetically generated clusters, we conclude that these intriguing structures we find in image datasets do not fit the notion of clusters in the traditional sense. Further suggested by our observation is a fast method for clustering high-dimensional data in a hierarchical fashion; at each stage, the data is partitioned into two based on the binary clustering found in a 1D random projection of the data. Since most of the computations are performed in 1D, this approach is extremely efficient. But despite its simplicity, it achieves overall a better quality of clustering than existing high-dimensional clustering methods, not only for datasets representing image data, but for other real data sets as well. Our results highlight the need to re-examine our assumptions about high-dimensional clustering and the geometry of real datasets such as sets of images.
引用
收藏
页码:1927 / 1938
页数:12
相关论文
共 50 条
  • [41] NGPCA: Clustering of high-dimensional and non-stationary data streams
    Migenda, Nico
    Moeller, Ralf
    Schenck, Wolfram
    SOFTWARE IMPACTS, 2024, 20
  • [42] A Novel Approach for Clustering High-Dimensional Data using Kernel Hubness
    Amina, M.
    Farook, Syed K.
    2015 Fifth International Conference on Advances in Computing and Communications (ICACC), 2015, : 94 - 97
  • [43] Parameter-wise co-clustering for high-dimensional data
    M. P. B. Gallaugher
    C. Biernacki
    P. D. McNicholas
    Computational Statistics, 2023, 38 : 1597 - 1619
  • [44] Model-based regression clustering for high-dimensional data: application to functional data
    Devijver, Emilie
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2017, 11 (02) : 243 - 279
  • [45] Clustering High-Dimensional Landmark-Based Two-Dimensional Shape Data
    Huang, Chao
    Styner, Martin
    Zhu, Hongtu
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2015, 110 (511) : 946 - 961
  • [46] ClusterEnG: an interactive educational web resource for clustering and visualizing high-dimensional data
    Manjunath, Mohith
    Zhang, Yi
    Kim, Yeonsung
    Yeo, Steve H.
    Sobh, Omar
    Russell, Nathan
    Followell, Christian
    Bushell, Colleen
    Ravaioli, Umberto
    Song, Jun S.
    PEERJ COMPUTER SCIENCE, 2018,
  • [47] Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering
    Chu, Zhiguang
    He, Jingsha
    Zhang, Xiaolei
    Zhang, Xing
    Zhu, Nafei
    ELECTRONICS, 2023, 12 (09)
  • [48] Interactive information bottleneck for high-dimensional co-occurrence data clustering
    Hu, Shizhe
    Wang, Ruobin
    Ye, Yangdong
    APPLIED SOFT COMPUTING, 2021, 111
  • [49] Clustering approaches for high-dimensional databases: A review
    Mittal, Mamta
    Goyal, Lalit M.
    Hemanth, Duraisamy Jude
    Sethi, Jasleen K.
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 9 (03)
  • [50] Clustering and visualization of a high-dimensional diabetes dataset
    Lasek, Piotr
    Mei, Zhen
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KES 2019), 2019, 159 : 2179 - 2188