Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection

被引：64

作者：

Verbiest, Nele ^{[1
]}

Ramentol, Enislay ^{[2
]}

Cornelis, Chris ^{[1
,3
]}

Herrera, Francisco ^{[3
,4
]}

机构：

[1] Univ Ghent, Dept Appl Math & Comp Sci, B-9000 Ghent, Belgium

[2] Univ Camaguey, Dept Comp Sci, Camaguey, Cuba

[3] Univ Granada, Dept Comp Sci & AI, E-18071 Granada, Spain

[4] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah 21413, Saudi Arabia

来源：

APPLIED SOFT COMPUTING | 2014年 / 22卷

关键词：

Imbalanced classification; SMOTE; Prototype selection; Fuzzy rough set theory; STATISTICAL COMPARISONS; CLASSIFICATION; CLASSIFIERS; SETS;

D O I：

10.1016/j.asoc.2014.05.023

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Synthetic Minority Over Sampling TEchnique (SMOTE) is a widely used technique to balance imbalanced data. In this paper we focus on improving SMOTE in the presence of class noise. Many improvements of SMOTE have been proposed, mostly cleaning or improving the data after applying SMOTE. Our approach differs from these approaches by the fact that it cleans the data before applying SMOTE, such that the quality of the generated instances is better. After applying SMOTE we also carry out data cleaning, such that instances (original or introduced by SMOTE) that badly fit in the new dataset are also removed. To this goal we propose two prototype selection techniques both based on fuzzy rough set theory. The first fuzzy rough prototype selection algorithm removes noisy instances from the imbalanced dataset, the second cleans the data generated by SMOTE. An experimental evaluation shows that our method improves existing preprocessing methods for imbalanced classification, especially in the presence of noise. (C) 2014 Elsevier B.V. All rights reserved.

引用

页码：511 / 517

页数：7

共 26 条

[1] Barua S., 2012, IEEE T KNOWL DATA EN, V99
[2] Batista G. E., 2004, ACM SIGKDD Explor. Newslett., P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[3] The use of the area under the roc curve in the evaluation of machine learning algorithms
Bradley, AP
[J]. PATTERN RECOGNITION, 1997, 30 (07) : 1145 - 1159
[4] Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[5] SMOTE: Synthetic minority over-sampling technique
Chawla, Nitesh V.
Bowyer, Kevin W.
Hall, Lawrence O.
Kegelmeyer, W. Philip
[J]. 2002, American Association for Artificial Intelligence (16)
[6] Cornelis C, 2010, LECT NOTES ARTIF INT, V6401, P78, DOI 10.1007/978-3-642-16248-0_16
[7] NEAREST NEIGHBOR PATTERN CLASSIFICATION
COVER, TM
HART, PE
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) : 21 - +
[8] Demsar J, 2006, J MACH LEARN RES, V7, P1
[9] ROUGH FUZZY-SETS AND FUZZY ROUGH SETS
DUBOIS, D
PRADE, H
[J]. INTERNATIONAL JOURNAL OF GENERAL SYSTEMS, 1990, 17 (2-3) : 191 - 209
[10] Adaptive fraud detection
Fawcett, T
Provost, F
[J]. DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1 (03) : 291 - 316

← 1 2 3 →