Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

被引:0
作者
Radovanovic, Milos [1 ]
Nanopoulos, Alexandros [2 ]
Ivanovic, Mirjana [1 ]
机构
[1] Univ Novi Sad, Dept Math & Informat, Novi Sad 21000, Serbia
[2] Univ Hildesheim, Inst Comp Sci, D-31141 Hildesheim, Germany
关键词
nearest neighbors; curse of dimensionality; classification; semi-supervised learning; clustering; POINT-PROCESSES; REDUCTION; ALGORITHMS;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent "popular" nearest neighbors. We examine the origins of this phenomenon, showing that it is an inherent property of data distributions in high-dimensional vector space, discuss its interaction with dimensionality reduction, and explore its influence on a wide range of machine-learning tasks directly or indirectly based on measuring distances, belonging to supervised, semi-supervised, and unsupervised learning families.
引用
收藏
页码:2487 / 2531
页数:45
相关论文
共 50 条
  • [21] Analysis of high-dimensional data using local input space histograms
    Kerdels, Jochen
    Peters, Gabriele
    NEUROCOMPUTING, 2015, 169 : 272 - 280
  • [22] Statistical challenges of high-dimensional data INTRODUCTION
    Johnstone, Iain M.
    Titterington, D. Michael
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2009, 367 (1906): : 4237 - 4253
  • [23] A new approach to discover interlacing data structures in high-dimensional space
    Ban, Tao
    Zhang, Changshui
    Abe, Shigeo
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2009, 33 (01) : 3 - 22
  • [24] Inference in High-Dimensional Parameter Space
    O'Hare, Anthony
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2015, 22 (11) : 997 - 1004
  • [25] The nearest polyhedral convex conic regions for high-dimensional classification
    Cevikalp, Hakan
    Cimen, Emre
    Ozturk, Gurkan
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2021, 29 (02) : 913 - 928
  • [26] Randomized Embeddings with Slack and High-Dimensional Approximate Nearest Neighbor
    Anagnostopoulos, Evangelos
    Emiris, Ioannis Z.
    Psarros, Ioannis
    ACM TRANSACTIONS ON ALGORITHMS, 2018, 14 (02)
  • [27] New instability results for high-dimensional nearest neighbor search
    Giannella, Chris
    INFORMATION PROCESSING LETTERS, 2009, 109 (19) : 1109 - 1113
  • [28] Hubness-aware shared neighbor distances for high-dimensional -nearest neighbor classification
    Tomasev, Nenad
    Mladenic, Dunja
    KNOWLEDGE AND INFORMATION SYSTEMS, 2014, 39 (01) : 89 - 122
  • [29] MULTICATEGORY VERTEX DISCRIMINANT ANALYSIS FOR HIGH-DIMENSIONAL DATA
    Wu, Tong Tong
    Lange, Kenneth
    ANNALS OF APPLIED STATISTICS, 2010, 4 (04) : 1698 - 1721
  • [30] Measuring the quality of projections of high-dimensional labeled data
    Benato, Barbara C.
    Falcao, Alexandre X.
    Telea, Alexandru C.
    COMPUTERS & GRAPHICS-UK, 2023, 116 : 287 - 297