Improving Credit Card Fraud Detection with Data Reduction Approaches

被引:1
作者
Wang, Huanjing [1 ]
Hancock, John [2 ]
Khoshgoftaar, Taghi M. [2 ]
机构
[1] Western Kentucky Univ, Sch Engn & Appl Sci, 1906 Coll Hts Blvd, Bowling Green, KY 42101 USA
[2] Florida Atlantic Univ, Dept Elect Engn & Comp Sci, 777 Glades Rd, Boca Raton, FL 33431 USA
关键词
Ensemble supervised feature selection; random undersampling; data reduction; credit card fraud; class imbalance; SELECTION; NETWORK; MACHINE;
D O I
10.1142/S0218539324400011
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Detecting fraudulent activities in credit card transactions can be challenging due to issues like high dimensionality and class imbalance that are often present in the datasets. To address these challenges, data reduction techniques such as data sampling and feature selection have become essential. In this study, we compare four approaches for data reduction: using data sampling alone, employing feature selection alone, applying data sampling followed by feature selection, and using feature selection followed by data sampling. Additionally, we include results using all features. We build classification models using five Decision Tree-based classifiers and Logistic Regression, and evaluate their performance using two performance metrics: the Area Under the Receiver Operating Characteristic Curve (AUC) and the Area under the Precision-Recall Curve (AUPRC). In this work, we adopt ensemble supervised feature selection (SFS) techniques and Random Undersampling (RUS) for data reduction. The experimental results demonstrate that all four data reduction techniques have the potential to improve the performance of classifiers. These results are valuable since the classifiers available are dependent upon application domains, computing environments, and licensing agreements. However, these techniques can be applied independently of all these dependencies. We recommend utilizing the ensemble SFS followed by RUS (SFS-RUS) approach as the preferred data reduction method due to its ability to run feature selection and data sampling in parallel. Additionally, we find that XGBoost and CatBoost outperform other classifiers.
引用
收藏
页数:27
相关论文
共 37 条
[1]  
Arun GK, 2020, IIOAB J, V11, P85
[2]  
Awoyemi A., 2018, 2 INT C INF COMM TEC, P140
[3]   Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets [J].
Bao, Lei ;
Juan, Cao ;
Li, Jintao ;
Zhang, Yongdong .
NEUROCOMPUTING, 2016, 172 :198-206
[4]   Data Sampling Approaches with Severely Imbalanced Big Data for Medicare Fraud Detection [J].
Bauder, Richard A. ;
Khoshgoftaar, Taghi M. ;
Hasanin, Tawfiq .
2018 IEEE 30TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2018, :137-142
[5]  
Breiman Leo, 2017, Classification and Regression Trees, DOI 10.1201/9781315139470
[6]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[7]  
Chiramdasu G., 2021, 2021 IEEE INT C OM L, P1
[8]  
Divakar K., 2019, International Journal of Electronics Communication and Computer Engineering (IJECCE), V10, P262
[9]   Extremely randomized trees [J].
Geurts, P ;
Ernst, D ;
Wehenkel, L .
MACHINE LEARNING, 2006, 63 (01) :3-42
[10]   A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities [J].
Gonzalez, Sergio ;
Garcia, Salvador ;
Del Ser, Javier ;
Rokach, Lior ;
Herrera, Francisco .
INFORMATION FUSION, 2020, 64 :205-237