An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

被引:0
作者
Kotekani S.S. [1 ,2 ]
Velchamy I. [1 ,2 ]
机构
[1] Department of Computer Applications, CMRIT, Bengaluru
[2] VTU, Belgavi
关键词
Class imbalance; Classification algorithms; Fraud detection; Health insurance; K-means; Smote;
D O I
10.20532/cit.2020.1005216
中图分类号
学科分类号
摘要
from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model's performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner's understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training. © 2020, Journal of Computing and Information Technology. All Rights Reserved.
引用
收藏
页码:269 / 285
页数:16
相关论文
共 50 条
[1]  
Johnson M. E., Nagarur N., Multi-Stage Methodology to Detect Health Insurance Claim Fraud, Health Care Management Science, 19, 3, pp. 249-260, (2016)
[2]  
Wang S., A Comprehensive Survey of Data Mining-Based Accounting-Fraud Detection Research, Proc. of the 2010 International Conference on Intelligent Computation Technology and Automation, (2010)
[3]  
Shamitha S. K., Ilango V., A Survey on Machine Learning Techniques for Fraud Detection in Healthcare, 7, 4, pp. 5862-5868, (2018)
[4]  
Kareem S., Et al., Framework for the Identification of Fraudulent Health Insurance Claims Using Association Rule Mining, Proc. of the IEEE Conference on Big Data and Analytics (ICBDA), pp. 99-104, (2017)
[5]  
Ali A., Et al., Classification with Class Imbalance Problem: A Review, International Journal of Advances in Soft Computing and its Applications, 7, 3, pp. 176-204, (2015)
[6]  
Kaur P., Gosain A., Issues and Challenges of Class Imbalance Problem in Classification, International Journal of Information Technology, (2018)
[7]  
Sundarkumar G. G., Ravi V., A Novel Hybrid Undersampling Method for Mining Unbalanced Datasets in Banking and Insurance, Engineering Applications of Artificial Intelligence, 37, pp. 368-377, (2015)
[8]  
Jiang X., Et al., Cost-Sensitive Parallel Learning Framework for Insurance Intelligence Operation, IEEE Transactions on Industrial Electronics, PP, c, (2018)
[9]  
Jishan S. T., Et al., Improving Accuracy of Students' Final Grade Prediction Model Using Optimal Equal Width Binning and Synthetic Minority Over-Sampling Technique, Decision Analytics, 2, 1, (2015)
[10]  
Hassib E. M., Et al., An Imbalanced Big Data Mining Framework for Improving Optimization Algorithms Performance, IEEE Access, 7, pp. 170774-170795, (2019)