Median-Based Classifiers for High-Dimensional Data

被引:25
作者
Hall, Peter [1 ]
Titterington, D. M. [2 ]
Xue, Jing-Hao [3 ]
机构
[1] Univ Melbourne, Dept Math & Stat, Melbourne, Vic 3010, Australia
[2] Univ Glasgow, Dept Stat, Glasgow G12 8QQ, Lanark, Scotland
[3] UCL, Dept Stat Sci, London WC1E 6BT, England
基金
英国工程与自然科学研究理事会;
关键词
Centroid classifier; Componentwise median; Data depth; Distance-based classifier; High-dimensional data; L-1; method; Robust method; Sample median; Spatial median; Strength of dependence; MICROARRAY DATA; DATA DEPTH; SHRUNKEN CENTROIDS; CLASS PREDICTION; CLASSIFICATION; CANCER; INFERENCE; VARIANCE;
D O I
10.1198/jasa.2009.tm08107
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Conventional distance-based classifiers use standard Euclidean distance, and so can suffer from excessive volatility if vector components have heavy-tailed distributions. This difficulty can be alleviated by replacing the L-2 distance by its L-1 counterpart. For example, the L-1 version of the popular centroid classifier would allocate a new data value to the population to whose centroid it was closest in L-1 terms. However, this approach can lead to inconsistency, because the centroid is defined using L-2, rather than L-1, distance. In particular, by mixing L-1 and L-2 approaches, we produce a classifier that can seriously misidentify data in cases where the means and medians of marginal distributions take different values. These difficulties motivate replacing centroids by medians. However, in the very-high-dimensional settings commonly encountered today, this can be problematic if we attempt to work with a conventional spatial median. Therefore, we suggest using componentwise medians to construct a robust classifier that is relatively insensitive to the difficulties caused by heavy-tailed data and entails straightforward computation. We also consider generalizations and extensions of this approach based on, for example, using data truncation to achieve additional robustness. Using both empirical and theoretical arguments, we explore the properties of these methods, and show that the resulting classifiers can be particularly effective. Supplementary materials are available online.
引用
收藏
页码:1597 / 1608
页数:12
相关论文
共 47 条
[1]   Classification by ensembles from random partitions of high-dimensional data [J].
Ahn, Hongshik ;
Moon, Hojin ;
Fazzari, Melissa J. ;
Lim, Noha ;
Chen, James J. ;
Kodell, Ralph L. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 51 (12) :6166-6179
[2]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]  
[Anonymous], 1983, Stat Prob Letters, DOI DOI 10.1016/0167-7152(83)90054-8
[4]  
[Anonymous], Journal of machine learning research
[5]  
Baggerly KA, 1999, ANN STAT, V27, P843
[6]   Fast approximations for sums of distances, clustering and the Fermat-Weber problem [J].
Bose, P ;
Maheshwari, A ;
Morin, P .
COMPUTATIONAL GEOMETRY-THEORY AND APPLICATIONS, 2003, 24 (03) :135-146
[7]   Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions [J].
Bradley, Richard C. .
PROBABILITY SURVEYS, 2005, 2 :107-144
[8]  
Breiman L., 1984, BIOMETRICS, V40, P874, DOI 10.1201/9781315139470
[9]   Optimality Driven Nearest Centroid Classification from Genomic Data [J].
Dabney, Alan R. ;
Storey, John D. .
PLOS ONE, 2007, 2 (10)
[10]   Classification of microarrays to nearest centroids [J].
Dabney, AR .
BIOINFORMATICS, 2005, 21 (22) :4148-4154