PROTOTYPE SELECTION FOR INTERPRETABLE CLASSIFICATION

被引：166

作者：

Bien, Jacob ^{[1
]}

Tibshirani, Robert ^{[1
,2
]}

机构：

[1] Stanford Univ, Dept Stat, Stanford, CA 94305 USA

[2] Stanford Univ, Dept Hlth Res & Policy, Stanford, CA 94305 USA

来源：

ANNALS OF APPLIED STATISTICS | 2011年 / 5卷 / 04期

基金：

美国国家卫生研究院;

关键词：

Classification; prototypes; nearest neighbors; set cover; integer program; NEAREST-NEIGHBOR; ALGORITHMS; REDUCTION; MODELS;

D O I：

10.1214/11-AOAS495

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of "representative" samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables methods, this paper aims at achieving sparsity in the samples. We discuss a method for selecting prototypes in the classification setting (in which the samples fall into known discrete categories). Our method of focus is derived from three basic properties that we believe a good prototype set should satisfy. This intuition is translated into a set cover optimization problem, which we solve approximately using standard approaches. While prototype selection is usually viewed as purely a means toward building an efficient classifier, in this paper we emphasize the inherent value of having a set of prototypical elements. That said, by using the nearest-neighbor rule on the set of prototypes, we can of course discuss our method as a classifier as well. We demonstrate the interpretative value of producing prototypes on the well-known USPS ZIP code digits data set and show that as a classifier it performs reasonably well. We apply the method to a proteomics data set in which the samples are strings and therefore not naturally embedded in a vector space. Our method is compatible with any dissimilarity measure, making it amenable to situations in which using a non-Euclidean metric is desirable or even necessary.

引用

页码：2403 / 2424

页数：22

共 32 条

[1]

[Anonymous], 2001, Approximation algorithms

[2]

[Anonymous], 2006, DATA COMPLEXITY PATT

[3]

[Anonymous], 2007, Uci machine learning repository

[4] Approximation algorithms for the class cover problem [J].