Calibrating Probability with Undersampling for Unbalanced Classification

被引:252
|
作者
Dal Pozzolo, Andrea [1 ]
Caelen, Olivier [2 ]
Johnson, Reid A. [3 ]
Bontempi, Gianluca [1 ,4 ]
机构
[1] Univ Libre Bruxelles, Machine Learning Grp, Dept Comp Sci, Brussels, Belgium
[2] Worldline SA, Fraud Risk Management Analyt, Brussels, Belgium
[3] Univ Notre Dame, iCeNSA, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[4] Interuniv Inst Bioinformat Brussels IB2, Brussels, Belgium
关键词
D O I
10.1109/SSCI.2015.33
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier [9]. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.
引用
收藏
页码:159 / 166
页数:8
相关论文
共 50 条
  • [1] When is Undersampling Effective in Unbalanced Classification Tasks?
    Dal Pozzolo, Andrea
    Caelen, Olivier
    Bontempi, Gianluca
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2015, PT I, 2015, 9284 : 200 - 215
  • [2] Self-Organizing-Maps Based Undersampling for the Classification of Unbalanced Datasets
    Vannucci, Marco
    Colla, Valentina
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [3] A clustering-based adaptive undersampling ensemble method for highly unbalanced data classification
    Yuan, Xiaohan
    Sun, Chuan
    Chen, Shuyu
    APPLIED SOFT COMPUTING, 2024, 159
  • [4] Unbalanced data weighted boundary point integration undersampling method
    He Y.
    Leng X.
    Wan J.
    Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2021, 48 (04): : 176 - 183and191
  • [5] A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance
    Sundarkumar, G. Ganesh
    Ravi, Vadlamani
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2015, 37 : 368 - 377
  • [6] A Membership Probability–Based Undersampling Algorithm for Imbalanced Data
    Gilseung Ahn
    You-Jin Park
    Sun Hur
    Journal of Classification, 2021, 38 : 2 - 15
  • [7] Self-calibrating probability forecasting
    Vovk, V
    Shafer, G
    Nouretdinov, I
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 16, 2004, 16 : 1133 - 1140
  • [8] Calibrating random forests for probability estimation
    Dankowski, Theresa
    Ziegler, Andreas
    STATISTICS IN MEDICINE, 2016, 35 (22) : 3949 - 3960
  • [9] A Bayesian Approach for Calibrating Probability Judgments
    Firmino, Paulo Renato A.
    Santana, Nielson A.
    XI BRAZILIAN MEETING ON BAYESIAN STATISTICS (EBEB 2012), 2012, 1490 : 135 - 142
  • [10] Undersampling of approaching the classification boundary for imbalance problem
    Jiang, Lei
    Yuan, Peng
    Liao, Jing
    Zhang, Qiongbing
    Liu, Jianxun
    Li, Keqin
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (06): : 1