Kernel density estimation based sampling for imbalanced class distribution

被引:96
作者
Kamalov, Firuz [1 ]
机构
[1] Canadian Univ Dubai, Dept Elect Engn, Dubai, U Arab Emirates
关键词
Kernel; KDE; Imbalanced data; Class imbalance; Sampling; Oversampling; FEATURE-SELECTION; CHALLENGES; SMOTE;
D O I
10.1016/j.ins.2019.10.017
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Imbalanced response variable distribution is a common occurrence in data science. In fields such as fraud detection, medical diagnostics, system intrusion detection and many others where abnormal behavior is rarely observed the data under study often features disproportionate target class distribution. One common way to combat class imbalance is through resampling of the minority class to achieve a more balanced distribution. In this paper, we investigate the performance of the sampling method based on kernel density estimation (KDE). We believe that KDE offers a more natural way to generate new instances of minority class that is less prone to overfitting than other standard sampling techniques. It is based on a well established theory of nonparametric statistical estimation. Numerical experiments show that KDE can outperform other sampling techniques on a range of real life datasets as measured by F1-score and G-mean. The results remain consistent across a number of classification algorithms used in the experiments. Furthermore, the proposed method outperforms the benchmark methods irregardless of the class distribution ratio. We conclude, based on the solid theoretical foundation and strong experimental results, that the proposed method would be a valuable tool in problems involving imbalanced class distribution. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:1192 / 1201
页数:10
相关论文
共 35 条
[1]   To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques [J].
Abdi, Lida ;
Hashemi, Sattar .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) :238-251
[2]  
[Anonymous], ARXIV190903681
[3]  
[Anonymous], 2001, SciPy: open source scientific tools for Python, DOI DOI 10.1002/MP.16056
[4]  
[Anonymous], 2003, P WORKSH LEARN IMB D
[5]   KERNEL DENSITY ESTIMATION VIA DIFFUSION [J].
Botev, Z. I. ;
Grotowski, J. F. ;
Kroese, D. P. .
ANNALS OF STATISTICS, 2010, 38 (05) :2916-2957
[6]   l2,1 norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classification [J].
Cao, Peng ;
Liu, Xiaoli ;
Zhang, Jian ;
Zhao, Dazhe ;
Huang, Min ;
Zaiane, Osmar .
NEUROCOMPUTING, 2017, 234 :38-57
[7]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[8]   An introduction to ROC analysis [J].
Fawcett, Tom .
PATTERN RECOGNITION LETTERS, 2006, 27 (08) :861-874
[9]   SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary [J].
Fernandez, Alberto ;
Garcia, Salvador ;
Herrera, Francisco ;
Chawla, Nitesh V. .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2018, 61 :863-905
[10]   PDFOS: PDF estimation based over-sampling for imbalanced two-class problems [J].
Gao, Ming ;
Hong, Xia ;
Chen, Sheng ;
Harris, Chris J. ;
Khalaf, Emad .
NEUROCOMPUTING, 2014, 138 :248-259