An Empirical Study on Class Rarity in Big Data

被引:32
作者
Bauder, Richard A. [1 ]
Khoshgoftaar, Taghi M. [1 ]
Hasanin, Tawfiq [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
来源
2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA) | 2018年
关键词
Class Rarity; Severe Class Imbalance; Medicare; Fraud Detection; Big Data; HEALTH-CARE; FRAUD; CLASSIFICATION;
D O I
10.1109/ICMLA.2018.00125
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of class imbalance, especially the classification of rare cases, is an important area in machine learning. These rare cases are typically the ones of interest, thus accurate classification of these instances is required. Class imbalance is a well-studied area with relatively small datasets, but there is limited research focusing on both rarity and class imbalance with Big Data. In this study, we focus on the impact of rare class classification in the area of fraud detection using publicly available real-world Big Data from Medicare data sources. We demonstrate that rarity significantly degrades fraud detection performance over three machine learning models and nine datasets, with varying numbers of positive class instances. From these experiments, we show clear groupings indicating different levels of class imbalance and rarity. Furthermore, our results, showing decreasing performance with increasing rarity, are corroborated using three additional Medicare Big Data sources.
引用
收藏
页码:785 / 790
页数:6
相关论文
共 37 条
[1]   Using emerging patterns and decision trees in rare-class classification [J].
Alhammady, H ;
Ramamohanarao, K .
FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, :315-318
[2]  
[Anonymous], MED PROV UT PAYM DAT
[3]  
[Anonymous], 2018, US Medicare Program
[4]  
[Anonymous], 2018, CTR MEDICARE MEDICAI
[5]  
[Anonymous], 2007, ICML, DOI DOI 10.1145/1273496.1273614
[6]   The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data [J].
Bauder, Richard A. ;
Khoshgoftaar, Taghi M. .
HEALTH INFORMATION SCIENCE AND SYSTEMS, 2018, 6
[7]   A Survey of Medicare Data Processing and Integration for Fraud Detection [J].
Bauder, Richard A. ;
Khoshgoftaar, Taghi M. .
2018 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2018, :9-14
[8]   Using statistical text classification to identify health information technology incidents [J].
Chai, Kevin E. K. ;
Anthony, Stephen ;
Coiera, Enrico ;
Magrabi, Farah .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (05) :980-985
[9]  
CMS, 2018, HCPCS GEN INF
[10]  
CMS, 2016, NAT PROV ID STAND NP