Imbalanced data issues in machine learning classifiers: a case study

被引:1
|
作者
Gong, Mingxing [1 ]
机构
[1] Univ Delaware, Alfred Lerner Coll Business, Inst Financial Serv Analyt, Purnell Hall, Newark, DE 19716 USA
来源
JOURNAL OF OPERATIONAL RISK | 2022年 / 17卷 / 04期
关键词
machine learning; imbalanced data; fraud risk; performance measures; cost sensitive learning; CLASSIFICATION;
D O I
10.21314/JOP.2022.027
中图分类号
F8 [财政、金融];
学科分类号
0202 ;
摘要
Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data. In practice, many model developers and validators fail to take this into account in their model development and validation. In addition, resampling is a common technique to address imbalanced data issues when building traditional logistic regression models. However, there has been no specific discussion regarding the resampling ratio used to rebalance the data or how the issue of imbalance impacts different kinds of machine learning classifiers, especially the more advanced ones. This paper aims to outline the special characteristics of the classifiers, compare different methods in dealing with imbalanced data issues and provide best practice in model development, evaluation and validation to avoid common pitfalls. Although the methods discussed in this paper can apply to general machine learning classifiers in applications with imbalanced data issues, by using a case study in credit card fraud detection this paper calls practitioners' attention to the imbalanced data problems therein, where class imbalance is often mistreated and lacks theoretical discussion.
引用
收藏
页码:17 / 36
页数:20
相关论文
共 50 条
  • [41] Active Learning From Imbalanced Data: A Solution of Online Weighted Extreme Learning Machine
    Yu, Hualong
    Yang, Xibei
    Zheng, Shang
    Sun, Changyin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2019, 30 (04) : 1088 - 1103
  • [42] High dimensional classifiers in the imbalanced case
    Bak, Britta Anker
    Jensen, Jens Ledet
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2016, 98 : 46 - 59
  • [43] Characterisation of Cognitive Load Using Machine Learning Classifiers of Electroencephalogram Data
    Wang, Qi
    Smythe, Daniel
    Cao, Jun
    Hu, Zhilin
    Proctor, Karl J.
    Owens, Andrew P.
    Zhao, Yifan
    SENSORS, 2023, 23 (20)
  • [44] Machine-Learning Classifiers for Malware Detection Using Data Features
    Habtor, Saleh Abdulaziz
    Dahah, Ahmed Haidarah Hasan
    JOURNAL OF ICT RESEARCH AND APPLICATIONS, 2021, 15 (03) : 265 - 290
  • [45] Types of minority class examples and their influence on learning classifiers from imbalanced data
    Napierala, Krystyna
    Stefanowski, Jerzy
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2016, 46 (03) : 563 - 597
  • [46] Class Imbalanced Data: Open Issues and Future Research Directions
    Rekha, G.
    Tyagi, Amit Kumar
    Sreenath, N.
    Mishra, Shashvi
    2021 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2021,
  • [47] Uncertainty Based Under-Sampling for Learning Naive Bayes Classifiers Under Imbalanced Data Sets
    Aridas, Christos K.
    Karlos, Stamatis
    Kanas, Vasileios G.
    Fazakis, Nikos
    Kotsiantis, Sotiris B.
    IEEE ACCESS, 2020, 8 : 2122 - 2133
  • [48] Machine learning-based sensitivity of steel frames with highly imbalanced and data
    Koh, Hyeyoung
    Blum, Hannah B.
    ENGINEERING STRUCTURES, 2022, 259
  • [49] An Efficient Machine Learning Method to Solve Imbalanced Data in Metabolic Disease Prediction
    Cecchini, Vania
    Nguyen, Thanh-Phuong
    Pfau, Thomas
    De landtsheer, Sebastien
    Sauter, Thomas
    PROCEEDINGS OF 2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2019), 2019, : 357 - 361
  • [50] Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers
    Wang, Zhenyuan
    Tsai, Chih-Fong
    Lin, Wei-Chao
    DATA TECHNOLOGIES AND APPLICATIONS, 2021, 55 (05) : 771 - 787