Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

被引:0
|
作者
Radovanovic, Milos [1 ]
Nanopoulos, Alexandros [2 ]
Ivanovic, Mirjana [1 ]
机构
[1] Univ Novi Sad, Dept Math & Informat, Novi Sad 21000, Serbia
[2] Univ Hildesheim, Inst Comp Sci, D-31141 Hildesheim, Germany
关键词
nearest neighbors; curse of dimensionality; classification; semi-supervised learning; clustering; POINT-PROCESSES; REDUCTION; ALGORITHMS;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent "popular" nearest neighbors. We examine the origins of this phenomenon, showing that it is an inherent property of data distributions in high-dimensional vector space, discuss its interaction with dimensionality reduction, and explore its influence on a wide range of machine-learning tasks directly or indirectly based on measuring distances, belonging to supervised, semi-supervised, and unsupervised learning families.
引用
收藏
页码:2487 / 2531
页数:45
相关论文
共 50 条
  • [1] REPORTING NEIGHBORS IN HIGH-DIMENSIONAL EUCLIDEAN SPACE
    Aiger, Dror
    Kaplan, Haim
    Sharir, Micha
    SIAM JOURNAL ON COMPUTING, 2014, 43 (04) : 1363 - 1395
  • [2] On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness
    Angiulli, Fabrizio
    JOURNAL OF MACHINE LEARNING RESEARCH, 2018, 18
  • [3] Hybrid (CPU/GPU) Exact Nearest Neighbors Search in High-Dimensional Spaces
    Muhr, David
    Affenzeller, Michael
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2022, PART II, 2022, 647 : 112 - 123
  • [4] Fuzzy nearest neighbor clustering of high-dimensional data
    Wang, HB
    Yu, YQ
    Zhou, DR
    Meng, B
    2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 2569 - 2572
  • [5] An efficient nearest neighbor search in high-dimensional data spaces
    Lee, DH
    Kim, HJ
    INFORMATION PROCESSING LETTERS, 2002, 81 (05) : 239 - 246
  • [6] On the Design and Applicability of Distance Functions in High-Dimensional Data Space
    Hsu, Chih-Ming
    Chen, Ming-Syan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (04) : 523 - 536
  • [7] A Normality Test for High-dimensional Data Based on the Nearest Neighbor Approach
    Chen, Hao
    Xia, Yin
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2023, 118 (541) : 719 - 731
  • [8] Efficient and Accurate Nearest Neighbor and Closest Pair Search in High-Dimensional Space
    Tao, Yufei
    Yi, Ke
    Sheng, Cheng
    Kalnis, Panos
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2010, 35 (03):
  • [9] The Role of Hubness in Clustering High-Dimensional Data
    Tomasev, Nenad
    Radovanovic, Milos
    Mladenic, Dunja
    Ivanovic, Mirjana
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (03) : 739 - 751
  • [10] Fast geometrical extraction of nearest neighbors from multi-dimensional data
    Aziz, Yasir
    Memon, Kashif Hussain
    PATTERN RECOGNITION, 2023, 136