Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

被引:0
作者
Radovanovic, Milos [1 ]
Nanopoulos, Alexandros [2 ]
Ivanovic, Mirjana [1 ]
机构
[1] Univ Novi Sad, Dept Math & Informat, Novi Sad 21000, Serbia
[2] Univ Hildesheim, Inst Comp Sci, D-31141 Hildesheim, Germany
关键词
nearest neighbors; curse of dimensionality; classification; semi-supervised learning; clustering; POINT-PROCESSES; REDUCTION; ALGORITHMS;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent "popular" nearest neighbors. We examine the origins of this phenomenon, showing that it is an inherent property of data distributions in high-dimensional vector space, discuss its interaction with dimensionality reduction, and explore its influence on a wide range of machine-learning tasks directly or indirectly based on measuring distances, belonging to supervised, semi-supervised, and unsupervised learning families.
引用
收藏
页码:2487 / 2531
页数:45
相关论文
共 50 条
  • [31] Measuring the quality of projections of high-dimensional labeled data
    Benato, Barbara C.
    Falcao, Alexandre X.
    Telea, Alexandru C.
    COMPUTERS & GRAPHICS-UK, 2023, 116 : 287 - 297
  • [32] Centralized and Distributed Anonymization for High-Dimensional Healthcare Data
    Mohammed, Noman
    Fung, Benjamin C. M.
    Hung, Patrick C. K.
    Lee, Cheuk-Kwong
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2010, 4 (04)
  • [33] k Nearest Neighbor Similarity Join Algorithm on High-Dimensional Data Using Novel Partitioning Strategy
    Ma, Youzhong
    Hua, Qiaozhi
    Wen, Zheng
    Zhang, Ruiling
    Zhang, Yongxin
    Li, Haipeng
    SECURITY AND COMMUNICATION NETWORKS, 2022, 2022
  • [34] Probabilistic classifiers with high-dimensional data
    Kim, Kyung In
    Simon, Richard
    BIOSTATISTICS, 2011, 12 (03) : 399 - 412
  • [35] ASYMPTOTIC INFERENCE FOR HIGH-DIMENSIONAL DATA
    Kuelbs, Jim
    Vidyashankar, Anand N.
    ANNALS OF STATISTICS, 2010, 38 (02) : 836 - 869
  • [36] Action Recognition in a High-Dimensional Feature Space
    Adiguzel, Hande
    Erdem, Hayrettin
    Ferhatosmanoglu, Hakan
    Duygulu, Pinar
    2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
  • [37] Bump hunting in high-dimensional data
    Friedman J.H.
    Fisher N.I.
    Statistics and Computing, 1999, 9 (2) : 123 - 143
  • [38] Recession forecasting with high-dimensional data
    Nevasalmi, Lauri
    JOURNAL OF FORECASTING, 2022, 41 (04) : 752 - 764
  • [39] A classification algorithm for high-dimensional data
    Roy, Asim
    INNS CONFERENCE ON BIG DATA 2015 PROGRAM, 2015, 53 : 345 - 355
  • [40] Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification
    Nenad Tomašev
    Miloš Radovanović
    Dunja Mladenić
    Mirjana Ivanović
    International Journal of Machine Learning and Cybernetics, 2014, 5 : 445 - 458