Automatically countering imbalance and its empirical relationship to cost

被引：169

作者：

Chawla, Nitesh V. ^{[1
]}

Cieslak, David A. ^{[1
]}

Hall, Lawrence O. ^{[2
]}

Joshi, Ajay ^{[2
]}

机构：

[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA

[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 2008年 / 17卷 / 02期

关键词：

classification; unbalanced data; cost-sensitive learning;

D O I：

10.1007/s10618-008-0087-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost- sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods-MetaCost and the Cost-Sensitive Classifiers-and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.

引用

页码：225 / 252

页数：28

共 41 条

[31]

PROVOST F, 1998, 15 INT C MACH LEARN, P445

[32]

Quinlan J. R., 2014, C4 5 PROGRAMS MACHIN

[33]

Sabhnani M, 2003, MLMTA'03: INTERNATIONAL CONFERENCE ON MACHINE LEARNING

[34]

MODELS, TECHNOLOGIES AND APPLICATIONS, P209

[35] Distributed computing in practice: the Condor experience [J].

Thain, D ;

Tannenbaum, T ;

Livny, M .

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2005, 17 (2-4) :323-356

[36] Learning when training data are costly: The effect of class distribution on tree induction [J].

Weiss, GM ;

Provost, F .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2003, 19 :315-354

[37]

WEISS GM, 2007, DMIN, P35

[38]

Witten Ian., 2005, Data Mining: Practical Machine Learning Tools and Techniques, V2

[39]

Woods K. S., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, P1417, DOI 10.1142/S0218001493000698

[40] Cost-sensitive learning by cost-proportionate example weighting [J].

Zadrozny, B ;

Langford, J ;

Abe, N .

THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, :435-442

← 1 2 3 4 5 →