Comparing SMOTE Family Techniques in Predicting Insurance Premium Defaulting using Machine Learning Models

被引:0
作者
Kotb, Mohamed Hanafy [1 ,2 ]
Ming, Ruixing [1 ]
机构
[1] Zhejiang Gongshang Univ, Sch Stat & Math, Hangzhou 310018, Peoples R China
[2] Assiut Univ, Fac Commerce, Dept Stat Math & Insurance, Asyut 71515, Egypt
关键词
Machine learning; classification; insurance; imbalanced data; SMOTE family; statistical analysis;
D O I
10.14569/IJACSA.2021.0120970
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Default in premium payments impacts significantly on the profitability of the insurance company. Therefore, predicting defaults in advance is very important for insurance companies. Predicting in the insurance sector is one of the most beneficial and important study areas in today's world, thanks to technological advancements. But because of the imbalanced datasets in this industry, predicting insurance premium defaulting becomes a difficult task. Moreover, there is no study that applies and compares different SMOTE family approaches to address the issue of imbalanced data. So, this study aims to compare different SMOTE family approaches. Such as Synthetic Minority Oversampling Technique (MOTE), Safe-level SMOTE (SLS), Relocating Safe-level SMOTE (RSLS), Density-based SMOTE (DBSMOTE), Borderline-SMOTE(BLSMOTE), Adaptive Synthetic Sampling (ADSYN), and Adaptive Neighbor Synthetic (ASN), SMOTE-Tomek, and SMOTE-ENN, to solve the problem of unbalanced data. This study applied a variety of machine learning (ML)classifiers to assess the performance of the SMOTE family in addressing the imbalanced problem. These classifiers including Logistic Regression (LR), CART, C4.5, C5.0, Support Vector Machine (SVM), Random Forest (RF), Bagged CART(BC), AdaBoost (ADA), Stochastic Gradient Boosting, (SGB), XGBOOST(XGB), NAIVE BAYES, (NB), k-Nearest Neighbors (K-NN), and Neural Networks (NN). Additionally, model validation strategies include Random hold-out. The findings obtained using various assessment measures show that ML algorithms do not perform well with imbalanced data, indicating that the problem of imbalanced data must be addressed. On the other hand, using balanced datasets created by SMOTE family techniques improves the performance of classifiers. Moreover, the Friedman test, a statistical significance test, further confirms that the hybrid SMOTE family methods are better than others, especially the SMOTE-TOMEK, which performs better than other resampling approaches. Moreover, among ML algorithms, the SVM model has produced the best results with the SMOTE-TOMEK.
引用
收藏
页码:621 / 629
页数:9
相关论文
共 27 条
[1]  
Abdelhadi S., 2020, Journal of Theoretical and Applied Information Technology, V98
[2]   Feature normalization and likelihood-based similarity measures for image retrieval [J].
Aksoy, S ;
Haralick, RM .
PATTERN RECOGNITION LETTERS, 2001, 22 (05) :563-582
[3]  
[Anonymous], 2018, SMOTEFAMILY COLLECTI
[4]  
Batista G., 2004, ACM SIGKDD Explor Newsl, V6, P20, DOI [DOI 10.1145/1007730.1007735, 10.1145/1007730.1007735]
[5]   DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique [J].
Bunkhumpornpat, Chumphol ;
Sinapiromsaran, Krung ;
Lursinsap, Chidchanok .
APPLIED INTELLIGENCE, 2012, 36 (03) :664-684
[6]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[7]  
Byeon H, 2021, INT J ADV COMPUT SC, V12, P74
[8]  
Byeon H, 2021, INT J ADV COMPUT SC, V12, P36
[9]  
Demsar J, 2006, J MACH LEARN RES, V7, P1
[10]  
Fisher R. A., 1956, Statistical Methods and Scientific Inference