Fraud Detection Using Large-scale Imbalance Dataset

被引:6
作者
Rubaidi, Zainab Saad [1 ,2 ]
Ben Ammar, Boulbaba [2 ]
Ben Aouicha, Mohamed [2 ]
机构
[1] Al Muthanna Univ, Coll Agr, Samawah, Iraq
[2] Univ Sfax, Fac Sci, Data Engn & Semant Res Unit, Sfax, Tunisia
关键词
Fraud detection; classification; machine learning; oversampling; undersampling; FEATURE-SELECTION; SAMPLING APPROACH; CLASSIFICATION; PREDICTION; ALGORITHM; SYSTEMS; SMOTE;
D O I
10.1142/S0218213022500373
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the context of machine learning, an imbalanced classification problem states to a dataset in which the classes are not evenly distributed. This problem commonly occurs when attempting to classify data in which the distribution of labels or classes is not uniform. Using resampling methods to accumulate samples or entries from the minority class or to drop those from the majority class can be considered the best solution to this problem. The focus of this study is to propose a framework pattern to handle any imbalance dataset for fraud detection. For this purpose, Undersampling (Random and NearMiss) and oversampling (Random, SMOTE, BorderLine SMOTE) were used as resampling techniques for the concentration of our experiments for balancing an evaluated dataset. For the first time, a large-scale unbalanced dataset collected from the Kaggle website was used to test both methods for detecting fraud in the Tunisian company for electricity and gas consumption. It was also evaluated with four machine learning classifiers: Logistic Regression (LR), Naive Bayes (NB), Random Forest, and XGBoost. Standard evaluation metrics like precision, recall, F1-score, and accuracy have been used to assess the findings. The experimental results clearly revealed that the RF model provided the best performance and outperformed all other matched classifiers with attained a classification accuracy of 89% using NearMiss undersampling and 99% using Random oversampling.
引用
收藏
页数:23
相关论文
共 54 条
[31]   A comparative study of iterative and non-iterative feature selection techniques for software defect prediction [J].
Khoshgoftaar, Taghi M. ;
Gao, Kehan ;
Napolitano, Amri ;
Wald, Randall .
INFORMATION SYSTEMS FRONTIERS, 2014, 16 (05) :801-822
[32]   Learning from imbalanced data: open challenges and future directions [J].
Krawczyk B. .
Progress in Artificial Intelligence, 2016, 5 (04) :221-232
[33]   Logistic Regression Learning Model for Handling Concept Drift with Unbalanced Data in Credit Card Fraud Detection System [J].
Kulkarni, Pallavi ;
Ade, Roshani .
PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION TECHNOLOGIES, IC3T 2015, VOL 2, 2016, 380 :681-689
[34]   An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics [J].
Lopez, Victoria ;
Fernandez, Alberto ;
Garcia, Salvador ;
Palade, Vasile ;
Herrera, Francisco .
INFORMATION SCIENCES, 2013, 250 :113-141
[35]  
Lunardon N, 2014, R J, V6, P79
[36]  
Mani I., 2003, P WORKSH LEARN IMB D, V126, P1
[37]  
Monedero I, 2006, LECT NOTES COMPUT SC, V3984, P725, DOI 10.1007/11751649_80
[38]  
Muniz C., 2009, IFSA EUSFLAT 2009 NE
[39]   Evolutionary rule-based systems for imbalanced data sets [J].
Orriols-Puig, Albert ;
Bernado-Mansilla, Ester .
SOFT COMPUTING, 2009, 13 (03) :213-225
[40]   Class imbalance revisited: a new experimental setup to assess the performance of treatment methods [J].
Prati, Ronaldo C. ;
Batista, Gustavo E. A. P. A. ;
Silva, Diego F. .
KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 45 (01) :247-270