Imbalanced data issues in machine learning classifiers: a case study

被引:1
|
作者
Gong, Mingxing [1 ]
机构
[1] Univ Delaware, Alfred Lerner Coll Business, Inst Financial Serv Analyt, Purnell Hall, Newark, DE 19716 USA
来源
JOURNAL OF OPERATIONAL RISK | 2022年 / 17卷 / 04期
关键词
machine learning; imbalanced data; fraud risk; performance measures; cost sensitive learning; CLASSIFICATION;
D O I
10.21314/JOP.2022.027
中图分类号
F8 [财政、金融];
学科分类号
0202 ;
摘要
Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data. In practice, many model developers and validators fail to take this into account in their model development and validation. In addition, resampling is a common technique to address imbalanced data issues when building traditional logistic regression models. However, there has been no specific discussion regarding the resampling ratio used to rebalance the data or how the issue of imbalance impacts different kinds of machine learning classifiers, especially the more advanced ones. This paper aims to outline the special characteristics of the classifiers, compare different methods in dealing with imbalanced data issues and provide best practice in model development, evaluation and validation to avoid common pitfalls. Although the methods discussed in this paper can apply to general machine learning classifiers in applications with imbalanced data issues, by using a case study in credit card fraud detection this paper calls practitioners' attention to the imbalanced data problems therein, where class imbalance is often mistreated and lacks theoretical discussion.
引用
收藏
页码:17 / 36
页数:20
相关论文
共 50 条
  • [1] Machine-learning classifiers for imbalanced tornado data
    Trafalis T.B.
    Adrianto I.
    Richman M.B.
    Lakshmivarahan S.
    Computational Management Science, 2014, 11 (4) : 403 - 418
  • [2] A machine learning case study to predict rare clinical event of interest: imbalanced data, interpretability, and practical considerations
    Zhong, Sheng
    Zhang, Jane
    Jiao, Jenny
    Zhu, Hongjian
    Xing, Yunzhao
    Wang, Li
    JOURNAL OF BIOPHARMACEUTICAL STATISTICS, 2024,
  • [3] Active Learning with Abstaining Classifiers for Imbalanced Drifting Data Streams
    Korycki, Lukasz
    Cano, Alberto
    Krawczyk, Bartosz
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 2334 - 2343
  • [4] Fuzzy prototype selection-based classifiers for imbalanced data. Case study
    Rodriguez Alvarez, Yanela
    Garcia Lorenzo, Maria Matilde
    Caballero Mota, Yaile
    Filiberto Cabrera, Yaima
    Garcia Hilarion, Isabel M.
    Montes de Oca, Daniela Machado
    Bello Perez, Rafael
    PATTERN RECOGNITION LETTERS, 2022, 163 : 183 - 190
  • [5] On Machine Learning with Imbalanced Data and Research Quality Evaluation Methodologies
    Lipitakis, Anastasia-Dimitra
    Lipitakis, Evangelia A. E. C.
    2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), VOL 1, 2014, : 451 - 457
  • [6] Machine Learning for Prediction of Imbalanced Data: Credit Fraud Detection
    Thanh Cong Tran
    Tran Khanh Dang
    PROCEEDINGS OF THE 2021 15TH INTERNATIONAL CONFERENCE ON UBIQUITOUS INFORMATION MANAGEMENT AND COMMUNICATION (IMCOM 2021), 2021,
  • [7] Imbalanced Data Problem in Machine Learning: A Review
    Altalhan, Manahel
    Algarni, Abdulmohsen
    Alouane, Monia Turki-Hadj
    IEEE ACCESS, 2025, 13 : 13686 - 13699
  • [8] Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
    de Vargas, Vitor Werner
    Schneider Aranda, Jorge Arthur
    Costa, Ricardo dos Santos
    da Silva Pereira, Paulo Ricardo
    Victoria Barbosa, Jorge Luis
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (01) : 31 - 57
  • [9] Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features
    Gregorius Satia Budhi
    Raymond Chiong
    Zuli Wang
    Multimedia Tools and Applications, 2021, 80 : 13079 - 13097
  • [10] Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features
    Budhi, Gregorius Satia
    Chiong, Raymond
    Wang, Zuli
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (09) : 13079 - 13097