Multi-class imbalanced big data classification on Spark

被引:61
作者
Sleeman, William C. [1 ]
Krawczyk, Bartosz [1 ]
机构
[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA
关键词
Machine learning; Big data; Imbalanced data classification; Multi-class imbalance; Spark; MapReduce; DECISION TREE; MAPREDUCE; SELECTION; ENSEMBLE; INFORMATION; ALGORITHMS; IMPROVE; SMOTE;
D O I
10.1016/j.knosys.2020.106598
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:15
相关论文
共 83 条
[21]  
Diego CF, 2017, 2017 IEEE ELECTRICAL INSULATION CONFERENCE (EIC), P1, DOI 10.1109/EIC.2017.8004657
[22]   Diversity techniques improve the performance of the best imbalance learning ensembles [J].
Diez-Pastor, Jose F. ;
Rodriguez, Juan J. ;
Garcia-Osorio, Cesar I. ;
Kuncheva, Ludmila I. .
INFORMATION SCIENCES, 2015, 325 :98-117
[23]   Random Balance: Ensembles of variable priors classifiers for imbalanced data [J].
Diez-Pastor, Jose F. ;
Rodriguez, Juan J. ;
Garcia-Osorio, Cesar ;
Kuncheva, Ludmila I. .
KNOWLEDGE-BASED SYSTEMS, 2015, 85 :96-111
[24]   Imbalanced Deep Learning by Minority Class Incremental Rectification [J].
Dong, Qi ;
Gong, Shaogang ;
Zhu, Xiatian .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (06) :1367-1381
[25]   Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE [J].
Douzas, Georgios ;
Bacao, Fernando .
INFORMATION SCIENCES, 2019, 501 :118-135
[26]  
Dua D., 2017, UCI machine learning repository
[27]   Entropy-based fuzzy support vector machine for imbalanced datasets [J].
Fan, Qi ;
Wang, Zhe ;
Li, Dongdong ;
Gao, Daqi ;
Zha, Hongyuan .
KNOWLEDGE-BASED SYSTEMS, 2017, 115 :87-99
[28]  
Fang F., 2015, SPARK KNN
[29]  
Fernandez A., 2018, LEARNING IMBALANCED
[30]   Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks [J].
Fernandez, Alberto ;
del Rio, Sara ;
Lopez, Victoria ;
Bawakid, Abdullah ;
del Jesus, Maria J. ;
Benitez, Jose M. ;
Herrera, Francisco .
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2014, 4 (05) :380-409