Fraud Detection Using Large-scale Imbalance Dataset

被引:6
作者
Rubaidi, Zainab Saad [1 ,2 ]
Ben Ammar, Boulbaba [2 ]
Ben Aouicha, Mohamed [2 ]
机构
[1] Al Muthanna Univ, Coll Agr, Samawah, Iraq
[2] Univ Sfax, Fac Sci, Data Engn & Semant Res Unit, Sfax, Tunisia
关键词
Fraud detection; classification; machine learning; oversampling; undersampling; FEATURE-SELECTION; SAMPLING APPROACH; CLASSIFICATION; PREDICTION; ALGORITHM; SYSTEMS; SMOTE;
D O I
10.1142/S0218213022500373
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the context of machine learning, an imbalanced classification problem states to a dataset in which the classes are not evenly distributed. This problem commonly occurs when attempting to classify data in which the distribution of labels or classes is not uniform. Using resampling methods to accumulate samples or entries from the minority class or to drop those from the majority class can be considered the best solution to this problem. The focus of this study is to propose a framework pattern to handle any imbalance dataset for fraud detection. For this purpose, Undersampling (Random and NearMiss) and oversampling (Random, SMOTE, BorderLine SMOTE) were used as resampling techniques for the concentration of our experiments for balancing an evaluated dataset. For the first time, a large-scale unbalanced dataset collected from the Kaggle website was used to test both methods for detecting fraud in the Tunisian company for electricity and gas consumption. It was also evaluated with four machine learning classifiers: Logistic Regression (LR), Naive Bayes (NB), Random Forest, and XGBoost. Standard evaluation metrics like precision, recall, F1-score, and accuracy have been used to assess the findings. The experimental results clearly revealed that the RF model provided the best performance and outperformed all other matched classifiers with attained a classification accuracy of 89% using NearMiss undersampling and 99% using Random oversampling.
引用
收藏
页数:23
相关论文
共 54 条
[1]  
Ali A., 2015, Int. J. Adv. Soft Comput. Appl, V7, P176
[2]   A visualization cybersecurity method based on features' dissimilarity [J].
AlShboul, Rabah ;
Thabtah, Fadi ;
Abdelhamid, Neda ;
Al-diabat, Mofleh .
COMPUTERS & SECURITY, 2018, 77 :289-303
[3]   Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study [J].
Amin, Adnan ;
Anwar, Sajid ;
Adnan, Awais ;
Nawaz, Muhammad ;
Howard, Newton ;
Qadir, Junaid ;
Hawalah, Ahmad ;
Hussain, Amir .
IEEE ACCESS, 2016, 4 :7940-7957
[4]  
[Anonymous], BIOMED RES INT
[5]   Strategies for learning in class imbalance problems [J].
Barandela, R ;
Sánchez, JS ;
García, V ;
Rangel, E .
PATTERN RECOGNITION, 2003, 36 (03) :849-851
[6]  
Batista Gustavo APA, 2004, ACM SIGKDD Explor Newsl, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[7]   SMOTE for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2013, 14
[8]   Fraud detection in electrical energy consumers using rough sets [J].
Cabral, JE ;
Pinto, JOP ;
Gontijo, EM ;
Reis, J .
2004 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN & CYBERNETICS, VOLS 1-7, 2004, :3625-3629
[9]  
Cabral JE, 2009, IEEE POW ENER SOC GE, P2283
[10]   A methodological approach to the classification of dermoscopy images [J].
Celebi, M. Emre ;
Kingravi, Hassan A. ;
Uddin, Bakhtiyar ;
Lyatornid, Hitoshi ;
Aslandogan, Y. Alp ;
Stoecker, William V. ;
Moss, Randy H. .
COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2007, 31 (06) :362-373