Surrounding neighborhood-based SMOTE for learning from imbalanced data sets

被引:43
作者
García, V. [1 ]
Sánchez, J.S. [1 ]
Martín-Félez, R. [1 ]
Mollineda, R.A. [1 ]
机构
[1] Institute of New Imaging Technologies, Department of Computer Languages and Systems, Universitat Jaume I, 12071 Castellón de la Plana, Av. Vicent Sos Baynat s/n
关键词
Gabriel graph; Imbalance; Nearest centroid neighborhood; Over-sampling; Relative neighborhood graph; SMOTE; Surrounding neighborhood;
D O I
10.1007/s13748-012-0027-5
中图分类号
学科分类号
摘要
Many traditional approaches to pattern classification assume that the problem classes share similar prior probabilities. However, in many real-life applications, this assumption is grossly violated. Often, the ratios of prior probabilities between classes are extremely skewed. This situation is known as the class imbalance problem. One of the strategies to tackle this problem consists of balancing the classes by resampling the original data set. The SMOTE algorithm is probably the most popular technique to increase the size of the minority class by generating synthetic instances. From the idea of the original SMOTE, we here propose the use of three approaches to surrounding neighborhood with the aim of generating artificial minority instances, but taking into account both the proximity and the spatial distribution of the examples. Experiments over a large collection of databases and using three different classifiers demonstrate that the new surrounding neighborhood-based SMOTE procedures significantly outperform other existing over-sampling algorithms. © 2012 Springer-Verlag Berlin Heidelberg.
引用
收藏
页码:347 / 362
页数:15
相关论文
共 51 条
  • [1] Alcala-Fdez J., Fernandez A., Luengo J., Derrac J., Garcia S., Sanchez L., Herrera F., Software tool: data set repository, integration of algorithms and experimental analysis framework, J. Multiple-Valued Logic Soft. Comput., 17, 2-3, pp. 255-287, (2011)
  • [2] Barandela R., Sanchez J.S., Garcia V., Rangel E., Strategies for learning in class imbalance problems, Pattern Recognit., 36, 3, pp. 849-851, (2003)
  • [3] Batista G.E.A.P.A., Prati R.C., Monard M.C., A study of the behavior of several methods for balancing machine learning training data, SIGKDD Explor. Newsl., 6, 1, pp. 20-29, (2004)
  • [4] Brown I., Mues C., An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Syst. Appl., 39, 3, pp. 3446-3453, (2012)
  • [5] Bunkhumpornpat C., Sinapiromsaran K., Lursinsap C., Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for handling the class imbalanced problem, Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475-482, (2009)
  • [6] Chaudhuri B.B., A new definition of neighborhood of a point in multi-dimensional space, Pattern Recognit. Lett., 17, 1, pp. 11-17, (1996)
  • [7] Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P., SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res., 16, pp. 321-357, (2002)
  • [8] Chawla N.V., Lazarevic A., Hall L.O., Bowyer K.W., SMOTEBoost: improving prediction of the minority class in boosting, Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 107-119, (2003)
  • [9] Chen E., Lin Y., Xiong H., Luo Q., Ma H., Exploiting probabilistic topic models to improve text categorization under class imbalance, Inf. Process. Manage., 47, 2, pp. 202-214, (2011)
  • [10] Cohen G., Hilario M., Sax H., Hugonnet S., Geissbuhler A., Learning from imbalanced data in surveillance of nosocomial infection, Artif. Intell. Med., 37, 1, pp. 7-18, (2006)