Multi-class imbalanced big data classification on Spark

被引:61
作者
Sleeman, William C. [1 ]
Krawczyk, Bartosz [1 ]
机构
[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA
关键词
Machine learning; Big data; Imbalanced data classification; Multi-class imbalance; Spark; MapReduce; DECISION TREE; MAPREDUCE; SELECTION; ENSEMBLE; INFORMATION; ALGORITHMS; IMPROVE; SMOTE;
D O I
10.1016/j.knosys.2020.106598
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:15
相关论文
共 83 条
[1]  
Ahmed F, 2016, 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), P532, DOI 10.1109/BigData.2016.7840644
[2]  
Apache Software Foundation, 2019, APACHE HADOOP
[3]  
Apache Software Foundation, 2010, HDFS ARCH
[4]  
Apache Software Foundation, 2019, RDD PROGR GUID
[5]   Neighbourhood sampling in bagging for imbalanced data [J].
Blaszczynski, Jerzy ;
Stefanowski, Jerzy .
NEUROCOMPUTING, 2015, 150 :529-542
[6]   Decision tree induction based on minority entropy for the class imbalance problem [J].
Boonchuay, Kesinee ;
Sinapiromsaran, Krung ;
Lursinsap, Chidchanok .
PATTERN ANALYSIS AND APPLICATIONS, 2017, 20 (03) :769-782
[8]  
Cano A, 2015, J MACH LEARN RES, V16, P491
[9]   ur-CAIM: improved CAIM discretization for unbalanced and balanced data [J].
Cano, Alberto ;
Nguyen, Dat T. ;
Ventura, Sebastian ;
Cios, Krzysztof J. .
SOFT COMPUTING, 2016, 20 (01) :173-188
[10]   Weighted Data Gravitation Classification for Standard and Imbalanced Data [J].
Cano, Alberto ;
Zafra, Amelia ;
Ventura, Sebastian .
IEEE TRANSACTIONS ON CYBERNETICS, 2013, 43 (06) :1672-1687