Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

被引:17
作者
Wojciechowski S. [1 ]
Wilk S. [1 ]
机构
[1] Institute of Computing Science, Poznan University of Technology, Piotrowo 2, Poznan
来源
| 1600年 / Walter de Gruyter GmbH卷 / 42期
关键词
difficulty factors; imbalanced data; learning and classification; preprocessing methods;
D O I
10.1515/fcds-2017-0007
中图分类号
学科分类号
摘要
In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN. © by Szymon Wilk 2017.
引用
收藏
页码:149 / 176
页数:27
相关论文
共 33 条
  • [1] Bak B.A., Jensen J.L., High dimensional classifiers in the imbalanced case, Computational Statistics and Data Analysis, 98, pp. 46-59, (2016)
  • [2] Batista G., Silva D., Prati R., An experimental design to evaluate class imbalance treatment methods, Proc. of ICMLA'12 (Vol. 2), IEEE, pp. 95-101, (2012)
  • [3] Caruana R., Karampatziakis N., Yessenalina A., An empirical evaluation of supervised learning in high dimensions, Proc. of the 25th International Conference on Machine Learning (ICML 2008), pp. 96-103, (2008)
  • [4] Chawla N., Bowyer K., Hall L., Kegelmeyer W., Smote: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, pp. 341-378, (2002)
  • [5] Demsar J., Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, 7, pp. 1-30, (2006)
  • [6] Dittman D.J., Khoshgoftaar T.M., Napolitano A., Selecting the appropriate data sampling approach for imbalanced and high-dimensional bioinformatics datasets, Proc.-IEEE 14th International Conference on Bioinformatics and Bioengineering (BIBE 2014), pp. 304-310, (2014)
  • [7] Drummond C., Holte R., Severe class imbalance: Why better algorithms aren't the answer, Proc. of the 16th European Conference on Machine Learning (ECML 2005), pp. 539-546, (2005)
  • [8] Fernandez A., Lopez V., Galar M., Del Jesus M.J., Herrera F., Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches, Knowledge-Based Systems, 42, pp. 97-110, (2013)
  • [9] Garcia V., Sanchez J., Mollineda R., An empirical study of the behavior of classifiers on imbalanced and overlapped data sets, Proc. of the 12th Iberoamerican Conference on Progress in Pattern Recognition, Image Analysis and Applications, pp. 397-406, (2007)
  • [10] Garcia V., Sanchez J., Mollineda R., On the k-NN performance in a challenging scenario of imbalance and overlapping, Pattern Analysis and Applications, 11, 3-4, pp. 269-280, (2008)