Automatically countering imbalance and its empirical relationship to cost

被引:167
作者
Chawla, Nitesh V. [1 ]
Cieslak, David A. [1 ]
Hall, Lawrence O. [2 ]
Joshi, Ajay [2 ]
机构
[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
classification; unbalanced data; cost-sensitive learning;
D O I
10.1007/s10618-008-0087-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost- sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods-MetaCost and the Cost-Sensitive Classifiers-and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.
引用
收藏
页码:225 / 252
页数:28
相关论文
共 41 条
  • [31] PROVOST F, 1998, 15 INT C MACH LEARN, P445
  • [32] Quinlan J. R., 2014, C4 5 PROGRAMS MACHIN
  • [33] Sabhnani M, 2003, MLMTA'03: INTERNATIONAL CONFERENCE ON MACHINE LEARNING
  • [34] MODELS, TECHNOLOGIES AND APPLICATIONS, P209
  • [35] Distributed computing in practice: the Condor experience
    Thain, D
    Tannenbaum, T
    Livny, M
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2005, 17 (2-4) : 323 - 356
  • [36] Learning when training data are costly: The effect of class distribution on tree induction
    Weiss, GM
    Provost, F
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2003, 19 : 315 - 354
  • [37] WEISS GM, 2007, DMIN, P35
  • [38] Witten Ian., 2005, Data Mining: Practical Machine Learning Tools and Techniques, V2
  • [39] Woods K. S., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, P1417, DOI 10.1142/S0218001493000698
  • [40] Cost-sensitive learning by cost-proportionate example weighting
    Zadrozny, B
    Langford, J
    Abe, N
    [J]. THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 435 - 442