Fraud Detection Using Large-scale Imbalance Dataset

被引:2
|
作者
Rubaidi, Zainab Saad [1 ,2 ]
Ben Ammar, Boulbaba [2 ]
Ben Aouicha, Mohamed [2 ]
机构
[1] Al Muthanna Univ, Coll Agr, Samawah, Iraq
[2] Univ Sfax, Fac Sci, Data Engn & Semant Res Unit, Sfax, Tunisia
关键词
Fraud detection; classification; machine learning; oversampling; undersampling; FEATURE-SELECTION; SAMPLING APPROACH; CLASSIFICATION; PREDICTION; ALGORITHM; SYSTEMS; SMOTE;
D O I
10.1142/S0218213022500373
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the context of machine learning, an imbalanced classification problem states to a dataset in which the classes are not evenly distributed. This problem commonly occurs when attempting to classify data in which the distribution of labels or classes is not uniform. Using resampling methods to accumulate samples or entries from the minority class or to drop those from the majority class can be considered the best solution to this problem. The focus of this study is to propose a framework pattern to handle any imbalance dataset for fraud detection. For this purpose, Undersampling (Random and NearMiss) and oversampling (Random, SMOTE, BorderLine SMOTE) were used as resampling techniques for the concentration of our experiments for balancing an evaluated dataset. For the first time, a large-scale unbalanced dataset collected from the Kaggle website was used to test both methods for detecting fraud in the Tunisian company for electricity and gas consumption. It was also evaluated with four machine learning classifiers: Logistic Regression (LR), Naive Bayes (NB), Random Forest, and XGBoost. Standard evaluation metrics like precision, recall, F1-score, and accuracy have been used to assess the findings. The experimental results clearly revealed that the RF model provided the best performance and outperformed all other matched classifiers with attained a classification accuracy of 89% using NearMiss undersampling and 99% using Random oversampling.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] DMDD: A Large-Scale Dataset for Dataset Mentions Detection
    Pan, Huitong
    Zhang, Qi
    Dragut, Eduard
    Caragea, Cornelia
    Latecki, Longin Jan
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1132 - 1146
  • [2] FraudAmmo: Large Scale Synthetic Transactional Dataset for Payment Fraud Detection
    Ramachandran, Karthikeswaren
    Kayathwal, Kanishka
    Wadhwa, Hardik
    Dhama, Gaurav
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [3] KoDF: A Large-scale Korean DeepFake Detection Dataset
    Kwon, Patrick
    You, Jaeseong
    Nam, Gyuhyeon
    Park, Sungwoo
    Chae, Gyeongsu
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10724 - 10733
  • [4] USED: A Large-scale Social Event Detection Dataset
    Ahmad, Kashif
    Conci, Nicola
    Boato, Giulia
    De Natale, Francesco G. B.
    PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON MULTIMEDIA SYSTEMS (MMSYS'16), 2016, : 380 - 385
  • [5] WAID: A Large-Scale Dataset for Wildlife Detection with Drones
    Mou, Chao
    Liu, Tengfei
    Zhu, Chengcheng
    Cui, Xiaohui
    APPLIED SCIENCES-BASEL, 2023, 13 (18):
  • [6] Nostalgia on Twitter: Detection and Analysis of a Large-Scale Dataset
    Stanley Jothiraj, Fiona Victoria
    Hong, Lingzi
    Mashhadi, Afra
    Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 349 - 360
  • [7] A graph-powered large-scale fraud detection system
    Zhao Li
    Biao Wang
    Jiaming Huang
    Yilun Jin
    Zenghui Xu
    Ji Zhang
    Jianliang Gao
    International Journal of Machine Learning and Cybernetics, 2024, 15 : 115 - 128
  • [8] A graph-powered large-scale fraud detection system
    Li, Zhao
    Wang, Biao
    Huang, Jiaming
    Jin, Yilun
    Xu, Zenghui
    Zhang, Ji
    Gao, Jianliang
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (01) : 115 - 128
  • [9] Collaborative Camouflaged Object Detection: A Large-Scale Dataset and Benchmark
    Zhang, Cong
    Bi, Hongbo
    Xiang, Tian-Zhu
    Wu, Ranwan
    Tong, Jinghui
    Wang, Xiufang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (12) : 1 - 15
  • [10] LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
    Yao, Feng
    Xiao, Chaojun
    Wang, Xiaozhi
    Liu, Zhiyuan
    Hou, Lei
    Tu, Cunchao
    Li, Juanzi
    Liu, Yun
    Shen, Weixing
    Sun, Maosong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 183 - 201