SOUL: Scala Oversampling and Undersampling Library for imbalance classification

被引:1
作者
Rodriguez, Nestor [1 ]
Lopez, David [1 ]
Fernandez, Alberto [1 ]
Garcia, Salvador [1 ]
Herrera, Francisco [1 ]
机构
[1] Univ Granada, DaSCI Andalusian Inst Data Sci & Computat Intelli, Granada, Spain
关键词
Oversampling; Undersampling; Scala; Imbalanced classification; SMOTE; PERFORMANCE; CHALLENGES; SELECTION; SPARK;
D O I
10.1016/j.softx.2021.100767
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The improvements in technology and computation have promoted a global adoption of Data Science. It is devoted to extracting significant knowledge from high amounts of information by means of the application of Artificial Intelligence and Machine Learning tools. Among the different tasks within Data Science, classification is probably the most widespread overall. Focusing on the classification scenario, we often face some datasets in which the number of instances for one of the classes is much lower than that of the remaining ones. This issue is known as the imbalanced classification problem, and it is mainly related to the need for boosting the recognition of the minority class examples. In spite of a large number of solutions that were proposed in the specialized literature to address imbalanced classification, there is a lack of open-source software that compiles the most relevant ones in an easy-to-use and scalable way. In this paper, we present a novel software approach named as SOUL, which stands for Scala Oversampling and Undersampling Library for imbalanced classification. The main capabilities of this new library include a large number of different data preprocessing techniques, efficient execution of these approaches, and a graphical environment to contrast the output for the different preprocessing solutions. (C) 2021 The Authors. Published by Elsevier B.V.
引用
收藏
页数:8
相关论文
共 51 条
[21]   Learning from Imbalanced Data [J].
He, Haibo ;
Garcia, Edwardo A. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (09) :1263-1284
[22]   ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning [J].
He, Haibo ;
Bai, Yang ;
Garcia, Edwardo A. ;
Li, Shutao .
2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-8, 2008, :1322-1328
[23]   A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and its Application to Imbalanced Classification [J].
Kang, Qi ;
Shi, Lei ;
Zhou, MengChu ;
Wang, XueSong ;
Wu, Qidi ;
Wei, Zhi .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (09) :4152-4165
[24]   A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification [J].
Kang, Qi ;
Chen, XiaoShuang ;
Li, Sisi ;
Zhou, MengChu .
IEEE TRANSACTIONS ON CYBERNETICS, 2017, 47 (12) :4263-4274
[25]   Learning from imbalanced data: open challenges and future directions [J].
Krawczyk B. .
Progress in Artificial Intelligence, 2016, 5 (04) :221-232
[26]  
Kubat M., 1997, INT C MACHINE LEARNI, V97, P179, DOI DOI 10.1007/S13398-014-0173-7.2
[27]   Improving identification of difficult small classes by balancing class distribution [J].
Laurikkala, J .
ARTIFICIAL INTELLIGENCE IN MEDICINE, PROCEEDINGS, 2001, 2101 :63-66
[28]  
Lemaître G, 2017, J MACH LEARN RES, V18
[29]   An Embedded Feature Selection Method for Imbalanced Data Classification [J].
Liu, Haoyue ;
Zhou, MengChu ;
Liu, Qing .
IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2019, 6 (03) :703-715
[30]   Exploratory Undersampling for Class-Imbalance Learning [J].
Liu, Xu-Ying ;
Wu, Jianxin ;
Zhou, Zhi-Hua .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2009, 39 (02) :539-550