Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection

被引:64
作者
Verbiest, Nele [1 ]
Ramentol, Enislay [2 ]
Cornelis, Chris [1 ,3 ]
Herrera, Francisco [3 ,4 ]
机构
[1] Univ Ghent, Dept Appl Math & Comp Sci, B-9000 Ghent, Belgium
[2] Univ Camaguey, Dept Comp Sci, Camaguey, Cuba
[3] Univ Granada, Dept Comp Sci & AI, E-18071 Granada, Spain
[4] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah 21413, Saudi Arabia
关键词
Imbalanced classification; SMOTE; Prototype selection; Fuzzy rough set theory; STATISTICAL COMPARISONS; CLASSIFICATION; CLASSIFIERS; SETS;
D O I
10.1016/j.asoc.2014.05.023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Synthetic Minority Over Sampling TEchnique (SMOTE) is a widely used technique to balance imbalanced data. In this paper we focus on improving SMOTE in the presence of class noise. Many improvements of SMOTE have been proposed, mostly cleaning or improving the data after applying SMOTE. Our approach differs from these approaches by the fact that it cleans the data before applying SMOTE, such that the quality of the generated instances is better. After applying SMOTE we also carry out data cleaning, such that instances (original or introduced by SMOTE) that badly fit in the new dataset are also removed. To this goal we propose two prototype selection techniques both based on fuzzy rough set theory. The first fuzzy rough prototype selection algorithm removes noisy instances from the imbalanced dataset, the second cleans the data generated by SMOTE. An experimental evaluation shows that our method improves existing preprocessing methods for imbalanced classification, especially in the presence of noise. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:511 / 517
页数:7
相关论文
共 26 条
  • [1] Barua S., 2012, IEEE T KNOWL DATA EN, V99
  • [2] Batista G. E., 2004, ACM SIGKDD Explor. Newslett., P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
  • [3] The use of the area under the roc curve in the evaluation of machine learning algorithms
    Bradley, AP
    [J]. PATTERN RECOGNITION, 1997, 30 (07) : 1145 - 1159
  • [4] Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
  • [5] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [6] Cornelis C, 2010, LECT NOTES ARTIF INT, V6401, P78, DOI 10.1007/978-3-642-16248-0_16
  • [7] NEAREST NEIGHBOR PATTERN CLASSIFICATION
    COVER, TM
    HART, PE
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) : 21 - +
  • [8] Demsar J, 2006, J MACH LEARN RES, V7, P1
  • [9] ROUGH FUZZY-SETS AND FUZZY ROUGH SETS
    DUBOIS, D
    PRADE, H
    [J]. INTERNATIONAL JOURNAL OF GENERAL SYSTEMS, 1990, 17 (2-3) : 191 - 209
  • [10] Adaptive fraud detection
    Fawcett, T
    Provost, F
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1 (03) : 291 - 316