An Empirical Study on Class Rarity in Big Data

被引:32
作者
Bauder, Richard A. [1 ]
Khoshgoftaar, Taghi M. [1 ]
Hasanin, Tawfiq [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
来源
2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA) | 2018年
关键词
Class Rarity; Severe Class Imbalance; Medicare; Fraud Detection; Big Data; HEALTH-CARE; FRAUD; CLASSIFICATION;
D O I
10.1109/ICMLA.2018.00125
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of class imbalance, especially the classification of rare cases, is an important area in machine learning. These rare cases are typically the ones of interest, thus accurate classification of these instances is required. Class imbalance is a well-studied area with relatively small datasets, but there is limited research focusing on both rarity and class imbalance with Big Data. In this study, we focus on the impact of rare class classification in the area of fraud detection using publicly available real-world Big Data from Medicare data sources. We demonstrate that rarity significantly degrades fraud detection performance over three machine learning models and nine datasets, with varying numbers of positive class instances. From these experiments, we show clear groupings indicating different levels of class imbalance and rarity. Furthermore, our results, showing decreasing performance with increasing rarity, are corroborated using three additional Medicare Big Data sources.
引用
收藏
页码:785 / 790
页数:6
相关论文
共 37 条
[11]   Learned lessons in credit card fraud detection from a practitioner perspective [J].
Dal Pozzolo, Andrea ;
Caelen, Olivier ;
Le Borgne, Yann-Ael ;
Waterschoot, Serge ;
Bontempi, Gianluca .
EXPERT SYSTEMS WITH APPLICATIONS, 2014, 41 (10) :4915-4928
[12]  
Demchenko Y., 2012, 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom). Proceedings, P614, DOI 10.1109/CloudCom.2012.6427494
[13]  
Dongre S.S., 2017, International Journal of Advanced Research in Computer Science, V8, P1102
[14]   An insight into imbalanced Big Data classification: outcomes and challenges [J].
Fernandez, Alberto ;
del Rio, Sara ;
Chawla, Nitesh V. ;
Herrera, Francisco .
COMPLEX & INTELLIGENT SYSTEMS, 2017, 3 (02) :105-120
[15]   Analysis of variance - Why it is more important than ever [J].
Gelman, A .
ANNALS OF STATISTICS, 2005, 33 (01) :1-31
[16]   Learning from class-imbalanced data: Review of methods and applications [J].
Guo Haixiang ;
Li Yijing ;
Shang, Jennifer ;
Gu Mingyun ;
Huang Yuanyue ;
Bing, Gong .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 73 :220-239
[17]   Learning from Imbalanced Data [J].
He, Haibo ;
Garcia, Edwardo A. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (09) :1263-1284
[18]   A review of data mining using big data in health informatics [J].
Herland M. ;
Khoshgoftaar T.M. ;
Wald R. .
Journal of Big Data, 2014, 1 (01)
[19]   Big Data fraud detection using multiple medicare data sources [J].
Herland, Matthew ;
Khoshgoftaar, Taghi M. ;
Bauder, Richard A. .
JOURNAL OF BIG DATA, 2018, 5 (01)
[20]   Facing Imbalanced Data Recommendations for the Use of Performance Metrics [J].
Jeni, Laszlo A. ;
Cohn, Jeffrey F. ;
De La Torre, Fernando .
2013 HUMAINE ASSOCIATION CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2013, :245-251