Automatically countering imbalance and its empirical relationship to cost

被引：168

作者：

Chawla, Nitesh V. ^{[1
]}

Cieslak, David A. ^{[1
]}

Hall, Lawrence O. ^{[2
]}

Joshi, Ajay ^{[2
]}

机构：

[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA

[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 2008年 / 17卷 / 02期

关键词：

classification; unbalanced data; cost-sensitive learning;

D O I：

10.1007/s10618-008-0087-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost- sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods-MetaCost and the Cost-Sensitive Classifiers-and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.

引用

页码：225 / 252

页数：28

共 41 条

[1]

Amor N.B., 2004, P 2004 ACM S APPL CO, P420, DOI DOI 10.1145/967900.967989

[2]

[Anonymous], P 7 ACM SIGKDD INT C

[3]

BANFIELD RE, 2005, P 6 INT C MULT CLASS, P196

[4]

Batista G.E.A.P.A., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]

[5]

Blake C.L., 1998, UCI repository of machine learning databases

[6]

BOWYER KW, 2000, P IEEE INT C SYST MA

[7] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[8]

Chawla N. V., 2004, ACM Sigkdd Explorations Newsletter, V6, P1, DOI [DOI 10.1145/1007730.1007733, 10.1145/1007730.1007733]

[9] SMOTE: Synthetic minority over-sampling technique [J].

Chawla, Nitesh V. ;

Bowyer, Kevin W. ;

Hall, Lawrence O. ;

Kegelmeyer, W. Philip .

2002, American Association for Artificial Intelligence (16)

[10]

CHAWLA NV, 2005, KDD WORKSH UT BAS DA

← 1 2 3 4 5 →