Maximizing classifier utility when there are data acquisition and modeling costs

被引:41
作者
Weiss, Gary M. [1 ]
Tian, Ye [1 ]
机构
[1] Fordham Univ, Dept Comp & Informat Sci, Bronx, NY 10458 USA
关键词
data mining; machine learning; induction; decision trees; utility-based data mining; cost-sensitive learning; active learning;
D O I
10.1007/s10618-007-0082-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Classification is a well-studied problem in data mining. Classification performance was originally gauged almost exclusively using predictive accuracy, but as work in the field progressed, more sophisticated measures of classifier utility that better represented the value of the induced knowledge were introduced. Nonetheless, most work still ignored the cost of acquiring training examples, even though this cost impacts the total utility of the data mining process. In this article we analyze the relationship between the number of acquired training examples and the utility of the data mining process and, given the necessary cost information, we determine the number of training examples that yields the optimum overall performance. We then extend this analysis to include the cost of model induction-measured in terms of the CPU time required to generate the model. While our cost model does not take into account all possible costs, our analysis provides some useful insights and a template for future analyses using more sophisticated cost models. Because our analysis is based on experiments that acquire the full set of training examples, it cannot directly be used to find a classifier with optimal or near-optimal total utility. To address this issue we introduce two progressive sampling strategies that are empirically shown to produce classifiers with near-optimal total utility.
引用
收藏
页码:253 / 282
页数:30
相关论文
共 27 条
[1]  
[Anonymous], WORKSH COST SENS LEA
[2]  
[Anonymous], 1999, P 5 ACM SIGKDD INT C
[3]  
[Anonymous], 1983, CLASSIFICATION REGRE
[4]  
[Anonymous], 1983, Statistical methods
[5]  
Berry M.J. A., 2004, DATA MINING TECHNIQU, V2nd
[6]  
Caruana R, 2004, ACM SIGKDD EXPLORATI, V6, P95, DOI [10.1145/1046456.1046470, DOI 10.1145/1046456.1046470]
[7]  
COHN D, 1994, MACH LEARN, V15, P201, DOI 10.1007/BF00993277
[8]   Cost curves: An improved method for visualizing classifier performance [J].
Drummond, Chris ;
Holte, Robert C. .
MACHINE LEARNING, 2006, 65 (01) :95-130
[9]  
Elkan C, 2001, IJCAI, DOI DOI 10.5555/1642194.1642224
[10]   A comparative analysis of methods for pruning decision trees [J].
Esposito, F ;
Malerba, D ;
Semeraro, G .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (05) :476-491