Probabilistic characterization of nearest neighbor classifier

被引:19
作者
Dhurandhar, Amit [1 ]
Dobra, Alin [2 ]
机构
[1] IBM TJ Watson, Yorktown Hts, NY 10598 USA
[2] Univ Florida, Gainesville, FL USA
基金
美国国家科学基金会;
关键词
kNN; Moments; ALGORITHMS; PREDICTION;
D O I
10.1007/s13042-012-0091-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The k-nearest neighbor classification algorithm (kNN) is one of the most simple yet effective classification algorithms in use. It finds major applications in text categorization, outlier detection, handwritten character recognition, fraud detection and in other related areas. Though sound theoretical results exist regarding convergence of the generalization error (GE) of this algorithm to Bayes error, these results are asymptotic in nature. The understanding of the behavior of the kNN algorithm in real world scenarios is limited. In this paper, assuming categorical attributes, we provide a principled way of studying the non-asymptotic behavior of the kNN algorithm. In particular, we derive exact closed form expressions for the moments of the GE for this algorithm. The expressions are functions of the sample, and hence can be computed given any joint probability distribution defined over the input-output space. These expressions can be used as a tool that aids in unveiling the statistical behavior of the algorithm in settings of interest viz. an acceptable value of k for a given sample size and distribution. Moreover, Monte Carlo approximations of such closed form expressions have been shown in Dhurandhar and Dobra (J Mach Learn Res 9, 2008; ACM Trans Knowl Discov Data 3, 2009) to be a superior alternative in terms of speed and accuracy when compared with computing the moments directly using Monte Carlo. This work employs the semi-analytical methodology that was proposed recently to better understand the non-asymptotic behavior of learning algorithms.
引用
收藏
页码:259 / 272
页数:14
相关论文
共 25 条
[1]  
Abello J., 2002, Handbook of Massive Data Sets
[2]  
[Anonymous], 2000, Pattern Classification
[3]  
Blum A, 1999, COMPUTATIONAL LEARNI
[4]   Output-sensitive algorithms for computing nearest-neighbour decision boundaries [J].
Bremner, D ;
Demaine, E ;
Erickson, J ;
Iacono, J ;
Langerman, S ;
Morin, P ;
Toussaint, G .
DISCRETE & COMPUTATIONAL GEOMETRY, 2005, 33 (04) :593-604
[5]  
Connor-Linton J., 2003, CHI SQUARE TUTORIAL
[6]   Semi-Analytical Method for Analyzing Models and Model Selection Measures Based on Moment Analysis [J].
Dhurandhar, Amit ;
Dobra, Alin .
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2009, 3 (01)
[7]  
Dhurandhar A, 2008, J MACH LEARN RES, V9, P2321
[8]  
Hall M.A., 2003, IEEE T KDE
[9]   CHOICE OF NEIGHBOR ORDER IN NEAREST-NEIGHBOR CLASSIFICATION [J].
Hall, Peter ;
Park, Byeong U. ;
Samworth, Richard J. .
ANNALS OF STATISTICS, 2008, 36 (05) :2135-2152
[10]   An efficient gene selection technique for cancer recognition based on neighborhood mutual information [J].
Hu, Qinghua ;
Pan, Wei ;
An, Shuang ;
Ma, Peijun ;
Wei, Jinmao .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2010, 1 (1-4) :63-74