Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

被引:0
作者
Radovanovic, Milos [1 ]
Nanopoulos, Alexandros [2 ]
Ivanovic, Mirjana [1 ]
机构
[1] Univ Novi Sad, Dept Math & Informat, Novi Sad 21000, Serbia
[2] Univ Hildesheim, Inst Comp Sci, D-31141 Hildesheim, Germany
关键词
nearest neighbors; curse of dimensionality; classification; semi-supervised learning; clustering; POINT-PROCESSES; REDUCTION; ALGORITHMS;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent "popular" nearest neighbors. We examine the origins of this phenomenon, showing that it is an inherent property of data distributions in high-dimensional vector space, discuss its interaction with dimensionality reduction, and explore its influence on a wide range of machine-learning tasks directly or indirectly based on measuring distances, belonging to supervised, semi-supervised, and unsupervised learning families.
引用
收藏
页码:2487 / 2531
页数:45
相关论文
共 50 条
[41]   Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification [J].
Tomasev, Nenad ;
Radovanovic, Milos ;
Mladenic, Dunja ;
Ivanovic, Mirjana .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2014, 5 (03) :445-458
[42]   Kernel-view based discriminant approach for embedded feature extraction in high-dimensional space [J].
Cheng, Miao ;
Fang, Bin ;
Pun, Chi-Man ;
Tang, Yuan Yan .
NEUROCOMPUTING, 2011, 74 (09) :1478-1484
[43]   Significance analysis of high-dimensional, low-sample size partially labeled data [J].
Lu, Qiyi ;
Qiao, Xingye .
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2016, 176 :78-94
[44]   Ensemble of Trees for Classifying High-Dimensional Imbalanced Genomic Data [J].
Farid, Dewan Md. ;
Nowe, Ann ;
Manderick, Bernard .
PROCEEDINGS OF SAI INTELLIGENT SYSTEMS CONFERENCE (INTELLISYS) 2016, VOL 1, 2018, 15 :172-187
[45]   Discriminative Embedded Clustering: A Framework for Grouping High-Dimensional Data [J].
Hou, Chenping ;
Nie, Feiping ;
Yi, Dongyun ;
Tao, Dacheng .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2015, 26 (06) :1287-1299
[46]   Hybrid fast unsupervised feature selection for high-dimensional data [J].
Manbari, Zhaleh ;
AkhlaghianTab, Fardin ;
Salavati, Chiman .
EXPERT SYSTEMS WITH APPLICATIONS, 2019, 124 :97-118
[47]   Locality sensitive batch feature extraction for high-dimensional data [J].
Ding, Jie ;
Wen, Changyun ;
Li, Guoqi ;
Chua, Chin Seng .
NEUROCOMPUTING, 2016, 171 :664-672
[48]   A Novel Separating Hyperplane Classification Framework to Unify Nearest-Class-Model Methods for High-Dimensional Data [J].
Zhu, Rui ;
Wang, Ziyu ;
Sogi, Naoya ;
Fukui, Kazuhiro ;
Xue, Jing-Hao .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (10) :3866-3876
[49]   A Novel Unsupervised Feature Selection for High-Dimensional Data Based on FCM and k -Nearest Neighbor Rough Sets [J].
Xu, Weihua ;
Zhang, Yang ;
Qian, Yuhua .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
[50]   Clustering High-Dimensional Data via Random Sampling and Consensus [J].
Traganitis, Panagiotis A. ;
Slavakis, Konstantinos ;
Giannakis, Georgios B. .
2014 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2014, :307-311