PROTOTYPE SELECTION FOR INTERPRETABLE CLASSIFICATION

被引:166
作者
Bien, Jacob [1 ]
Tibshirani, Robert [1 ,2 ]
机构
[1] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Hlth Res & Policy, Stanford, CA 94305 USA
基金
美国国家卫生研究院;
关键词
Classification; prototypes; nearest neighbors; set cover; integer program; NEAREST-NEIGHBOR; ALGORITHMS; REDUCTION; MODELS;
D O I
10.1214/11-AOAS495
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of "representative" samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables methods, this paper aims at achieving sparsity in the samples. We discuss a method for selecting prototypes in the classification setting (in which the samples fall into known discrete categories). Our method of focus is derived from three basic properties that we believe a good prototype set should satisfy. This intuition is translated into a set cover optimization problem, which we solve approximately using standard approaches. While prototype selection is usually viewed as purely a means toward building an efficient classifier, in this paper we emphasize the inherent value of having a set of prototypical elements. That said, by using the nearest-neighbor rule on the set of prototypes, we can of course discuss our method as a classifier as well. We demonstrate the interpretative value of producing prototypes on the well-known USPS ZIP code digits data set and show that as a classifier it performs reasonably well. We apply the method to a proteomics data set in which the samples are strings and therefore not naturally embedded in a vector space. Our method is compatible with any dissimilarity measure, making it amenable to situations in which using a non-Euclidean metric is desirable or even necessary.
引用
收藏
页码:2403 / 2424
页数:22
相关论文
共 32 条
[1]  
[Anonymous], 2001, Approximation algorithms
[2]  
[Anonymous], 2006, DATA COMPLEXITY PATT
[3]  
[Anonymous], 2007, Uci machine learning repository
[4]   Approximation algorithms for the class cover problem [J].
Cannon, AH ;
Cowen, LJ .
ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2004, 40 (3-4) :215-223
[5]   Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability [J].
Cano, Jose Ramon ;
Herrera, Francisco ;
Lozano, Manuel .
DATA & KNOWLEDGE ENGINEERING, 2007, 60 (01) :90-108
[6]   Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study [J].
Cano, JR ;
Herrera, F ;
Lozano, M .
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2003, 7 (06) :561-575
[7]   A new family of random graphs for testing spatial segregation [J].
Ceyhan, Elvan ;
Priebe, Carey E. ;
Marchette, David J. .
CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2007, 35 (01) :27-50
[8]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+
[9]  
DEVIJVER PA, 1982, PATTERN RECOGNITION
[10]   A SLLN for a one-dimensional class cover problem [J].
DeVinney, J ;
Wierman, JC .
STATISTICS & PROBABILITY LETTERS, 2002, 59 (04) :425-435