Learning from Imbalanced Data

被引：6280

作者：

He, Haibo ^{[1
]}

Garcia, Edwardo A. ^{[1
]}

机构：

[1] Stevens Inst Technol, Dept Elect & Comp Engn, Hoboken, NJ 07030 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2009年 / 21卷 / 09期

关键词：

Imbalanced learning; classification; sampling methods; cost-sensitive learning; kernel-based learning; active learning; assessment metrics; SUPPORT VECTOR MACHINES; CLASSIFICATION; RECOGNITION; SVM; ONLINE;

D O I：

10.1109/TKDE.2008.239

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

引用

页码：1263 / 1284

页数：22

共 145 条

[11] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].

Blewitt, Marnie E. ;

Gendrel, Anne-Valerie ;

Pang, Zhenyi ;

Sparrow, Duncan B. ;

Whitelaw, Nadia ;

Craig, Jeffrey M. ;

Apedaile, Anwyn ;

Hilton, Douglas J. ;

Dunwoodie, Sally L. ;

Brockdorff, Neil ;

Kay, Graham F. ;

Whitelaw, Emma .

NATURE GENETICS, 2008, 40 (05) :663-669

[12]

Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, P92, DOI 10.1145/279943.279962

[13]

Blum A, 2001, P 18 INT C MACH LEAR, P19, DOI DOI 10.1184/R1/6606860.V1

[14]

Bordes A, 2005, J MACH LEARN RES, V6, P1579

[15] Comparative experiments on learning information extractors for proteins and their interactions [J].

Bunescu, R ;

Ge, RF ;

Kate, RJ ;

Marcotte, EM ;

Mooney, RJ ;

Ramani, AK ;

Wong, YW .

ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 33 (02) :139-155

[16]

CARUANA R, 2000, P AM ASS ART INT AAA, P51

[17]

Chan P. K., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P164

[18] Distributed data mining in credit card fraud detection [J].

Chan, PK ;

Fan, W ;

Prodromidis, AL ;

Stolfo, SJ .

IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS, 1999, 14 (06) :67-74

[19]

Chawla N, 2003, P INT C MACH LEARN

[20] SMOTE: Synthetic minority over-sampling technique [J].

Chawla, Nitesh V. ;

Bowyer, Kevin W. ;

Hall, Lawrence O. ;

Kegelmeyer, W. Philip .

2002, American Association for Artificial Intelligence (16)

← 1 2 3 4 5 6 7 8 9 10 →