Automatically countering imbalance and its empirical relationship to cost

被引:167
作者
Chawla, Nitesh V. [1 ]
Cieslak, David A. [1 ]
Hall, Lawrence O. [2 ]
Joshi, Ajay [2 ]
机构
[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
classification; unbalanced data; cost-sensitive learning;
D O I
10.1007/s10618-008-0087-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost- sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods-MetaCost and the Cost-Sensitive Classifiers-and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.
引用
收藏
页码:225 / 252
页数:28
相关论文
共 41 条
  • [1] Amor N.B., 2004, P 2004 ACM S APPL CO, P420, DOI DOI 10.1145/967900.967989
  • [2] [Anonymous], P 7 ACM SIGKDD INT C
  • [3] BANFIELD RE, 2005, P 6 INT C MULT CLASS, P196
  • [4] Batista G.E.A.P.A., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
  • [5] Blake C.L., 1998, UCI repository of machine learning databases
  • [6] BOWYER KW, 2000, P IEEE INT C SYST MA
  • [7] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [8] Chawla N. V., 2004, ACM Sigkdd Explorations Newsletter, V6, P1, DOI [DOI 10.1145/1007730.1007733, 10.1145/1007730.1007733]
  • [9] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [10] CHAWLA NV, 2005, KDD WORKSH UT BAS DA