Active Evaluation of Classifiers on Large Datasets

被引：15

作者：

Katariya, Namit ^{[1
]}

Iyer, Arun ^{[2
]}

Sarawagi, Sunita ^{[1
]}

机构：

[1] Indian Inst Technol, Bombay, Maharashtra, India

[2] Yahoo Res, Bangalore, Karnataka, India

来源：

12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2012) | 2012年

关键词：

Accuracy estimation; active evaluation; learning hash functions;

D O I：

10.1109/ICDM.2012.161

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The goal of this work is to estimate the accuracy of a classifier on a large unlabeled dataset based on a small labeled set and a human labeler. We seek to estimate accuracy and select instances for labeling in a loop via a continuously refined stratified sampling strategy. For stratifying data we develop a novel strategy of learning r bit hash functions to preserve similarity in accuracy values. We show that our algorithm provides better accuracy estimates than existing methods for learning distance preserving hash functions. Experiments on a wide spectrum of real datasets show that our estimates achieve between 15% and 62% relative reduction in error compared to existing approaches. We show how to perform stratified sampling on unlabeled data that is so large that in an interactive setting even a single sequential scan is impractical. We present an optimal algorithm for performing importance sampling on a static index over the data that achieves close to exact estimates while reading three orders of magnitude less data.

引用

页码：329 / 338

页数：10

共 18 条

[1]

[Anonymous], 2008, NIPS

[2]

[Anonymous], 2010, Proceedings of the 23rdInternational Conference on Computational Linguistics: Posters

[3]

[Anonymous], 2009, NIPS

[4]

[Anonymous], ICCV

[5]

[Anonymous], 2011, ICML

[6]

[Anonymous], 2007, ICML

[7]

[Anonymous], 2005, ICML

[8]

[Anonymous], 2010, WWW

[9]

Bennett Paul N, 2010, CIKM

[10]

Cochran W.G., 2007, Sampling techniques

← 1 2 →