Active Evaluation of Classifiers on Large Datasets

被引:15
作者
Katariya, Namit [1 ]
Iyer, Arun [2 ]
Sarawagi, Sunita [1 ]
机构
[1] Indian Inst Technol, Bombay, Maharashtra, India
[2] Yahoo Res, Bangalore, Karnataka, India
来源
12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2012) | 2012年
关键词
Accuracy estimation; active evaluation; learning hash functions;
D O I
10.1109/ICDM.2012.161
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The goal of this work is to estimate the accuracy of a classifier on a large unlabeled dataset based on a small labeled set and a human labeler. We seek to estimate accuracy and select instances for labeling in a loop via a continuously refined stratified sampling strategy. For stratifying data we develop a novel strategy of learning r bit hash functions to preserve similarity in accuracy values. We show that our algorithm provides better accuracy estimates than existing methods for learning distance preserving hash functions. Experiments on a wide spectrum of real datasets show that our estimates achieve between 15% and 62% relative reduction in error compared to existing approaches. We show how to perform stratified sampling on unlabeled data that is so large that in an interactive setting even a single sequential scan is impractical. We present an optimal algorithm for performing importance sampling on a static index over the data that achieves close to exact estimates while reading three orders of magnitude less data.
引用
收藏
页码:329 / 338
页数:10
相关论文
共 18 条
[1]  
[Anonymous], 2008, NIPS
[2]  
[Anonymous], 2010, Proceedings of the 23rdInternational Conference on Computational Linguistics: Posters
[3]  
[Anonymous], 2009, NIPS
[4]  
[Anonymous], ICCV
[5]  
[Anonymous], 2011, ICML
[6]  
[Anonymous], 2007, ICML
[7]  
[Anonymous], 2005, ICML
[8]  
[Anonymous], 2010, WWW
[9]  
Bennett Paul N, 2010, CIKM
[10]  
Cochran W.G., 2007, Sampling techniques