Severely imbalanced Big Data challenges: investigating data sampling approaches

被引:63
作者
Hasanin, Tawfiq [1 ]
Khoshgoftaar, Taghi M. [1 ]
Leevy, Joffrey L. [1 ]
Bauder, Richard A. [1 ]
机构
[1] Florida Atlantic Univ, 777 Glades Rd, Boca Raton, FL 33431 USA
基金
美国国家科学基金会;
关键词
Big Data; Class imbalance; Machine Learning; Medicare fraud; Oversampling; SlowlorisBig; Undersampling; CLASSIFICATION; MAPREDUCE; SMOTE;
D O I
10.1186/s40537-019-0274-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.
引用
收藏
页数:25
相关论文
共 47 条
[1]  
Ali A, 2015, Int J Adv Soft Comput Appl, V7, P176
[2]  
[Anonymous], MED PROV UT PAYM DAT
[3]  
[Anonymous], 2014, Evolutionary computation for big data and big learning workshop, data mining competition 2014: Self-deployment track
[4]  
[Anonymous], 2016, The Journal of Machine Learning Research, DOI DOI 10.1145/2882903.2912565
[5]  
Apache Software Foundation, Apache Hadoop
[6]   An Empirical Study on Class Rarity in Big Data [J].
Bauder, Richard A. ;
Khoshgoftaar, Taghi M. ;
Hasanin, Tawfiq .
2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, :785-790
[7]   Data Sampling Approaches with Severely Imbalanced Big Data for Medicare Fraud Detection [J].
Bauder, Richard A. ;
Khoshgoftaar, Taghi M. ;
Hasanin, Tawfiq .
2018 IEEE 30TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2018, :137-142
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]  
Calvert C, 2019, 32 INT FLAIRS C
[10]  
Calvert C., 2018, 24th ISSAT international conference on reliability and quality in design, P191