Data reduction techniques for highly imbalanced medicare Big Data

被引:13
作者
Hancock, John T. [1 ]
Wang, Huanjing [2 ]
Khoshgoftaar, Taghi M. [1 ]
Liang, Qianxin [1 ]
机构
[1] Florida Atlantic Univ, Coll Engn & Comp Sci, Boca Raton, FL 33431 USA
[2] Western Kentucky Univ, Ogden Coll Sci & Engn, Bowling Green, KY USA
关键词
Random undersampling; Ensemble supervised feature selection; Big Data; Medicare fraud detection;
D O I
10.1186/s40537-023-00869-3
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In the domain of Medicare insurance fraud detection, handling imbalanced Big Data and high dimensionality remains a significant challenge. This study assesses the combined efficacy of two data reduction techniques: Random Undersampling (RUS), and a novel ensemble supervised feature selection method. The techniques are applied to optimize Machine Learning models for fraud identification in the classification of highly imbalanced Big Medicare Data. Utilizing two datasets from The Centers for Medicare & Medicaid Services (CMS) labeled by the List of Excluded Individuals/Entities (LEIE), our principal contribution lies in empirically demonstrating that data reduction techniques applied to these datasets significantly improves classification performance. The study employs a systematic experimental design to investigate various scenarios, ranging from using each technique in isolation to employing them in combination. The results indicate that a synergistic application of both techniques outperforms models that utilize all available features and data. Moreover, reduction in the number of features leads to more explainable models. Given the enormous financial implications of Medicare fraud, our findings not only offer computational advantages but also significantly enhance the effectiveness of fraud detection systems, thereby having the potential to improve healthcare services.
引用
收藏
页数:41
相关论文
共 34 条
[1]   A Novel Method for Fraudulent Medicare Claims Detection from Expected Payment Deviations [J].
Bauder, Richard A. ;
Khoshgoftaar, Taghi M. .
PROCEEDINGS OF 2016 IEEE 17TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IEEE IRI), 2016, :11-19
[2]  
Bekkar Mohamed, 2013, J Inf Eng Appl, V3, P27
[3]  
Boyd Kendrick, 2013, Machine Learning and Knowledge Discovery in Databases. European Conference, ECML PKDD 2013. Proceedings: LNCS 8190, P451, DOI 10.1007/978-3-642-40994-3_29
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[6]  
Breiman L., 2017, CLASSIFICATION REGRE, DOI [DOI 10.1201/9781315139470, 10.1201/9781315139470]
[7]  
Centers for Medicare and Medicaid Services, 2019, Fiscal year 2012 improper payment rates for cms programs
[8]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[9]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[10]  
Civil Division U.S. Department of Justice, 2020, Fraud Statistics, Overview