Calibrating Probability with Undersampling for Unbalanced Classification

被引:252
|
作者
Dal Pozzolo, Andrea [1 ]
Caelen, Olivier [2 ]
Johnson, Reid A. [3 ]
Bontempi, Gianluca [1 ,4 ]
机构
[1] Univ Libre Bruxelles, Machine Learning Grp, Dept Comp Sci, Brussels, Belgium
[2] Worldline SA, Fraud Risk Management Analyt, Brussels, Belgium
[3] Univ Notre Dame, iCeNSA, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[4] Interuniv Inst Bioinformat Brussels IB2, Brussels, Belgium
关键词
D O I
10.1109/SSCI.2015.33
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier [9]. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.
引用
收藏
页码:159 / 166
页数:8
相关论文
共 50 条
  • [31] GUM: A Guided Undersampling Method to Preprocess Imbalanced Datasets for Classification
    Sung, Kisuk
    Brown, W. Eric
    Moreno-Centeno, Erick
    Ding, Yu
    2022 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATION SCIENCE AND ENGINEERING (CASE), 2022, : 1086 - 1091
  • [32] An approach for classification of highly imbalanced data using weighting and undersampling
    Anand, Ashish
    Pugalenthi, Ganesan
    Fogel, Gary B.
    Suganthan, P. N.
    AMINO ACIDS, 2010, 39 (05) : 1385 - 1391
  • [33] Anomaly detection-based undersampling for imbalanced classification problems
    Park, You-Jin
    Brito, Paula
    Ma, Yun-Chen
    ENGINEERING OPTIMIZATION, 2024, 56 (12) : 2565 - 2578
  • [34] Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy
    Krawczyk, Bartosz
    Galar, Mikel
    Jelen, Lukasz
    Herrera, Francisco
    APPLIED SOFT COMPUTING, 2016, 38 : 714 - 726
  • [35] Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems
    Ng, Wing W. Y.
    Hu, Junjie
    Yeung, Daniel S.
    Yin, Shaohua
    Roli, Fabio
    IEEE TRANSACTIONS ON CYBERNETICS, 2015, 45 (11) : 2402 - 2412
  • [36] CALIBRATING PROBABILITIES FOR HYPERSPECTRAL CLASSIFICATION OF ROCK TYPES
    Monteiro, Sildomar T.
    Murphy, Richard J.
    2010 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2010, : 2800 - 2803
  • [37] Calibrating machine learning approaches for probability estimation: A short expansion
    Ojeda, Francisco M.
    Baker, Stuart G.
    Ziegler, Andreas
    STATISTICS IN MEDICINE, 2024, 43 (21) : 4212 - 4215
  • [38] Calibrating Probability Density Forecasts with Multi-objective Search
    Carney, Michael
    Cunningham, Padraig
    ECAI 2006, PROCEEDINGS, 2006, 141 : 791 - +
  • [39] Extended Probability Perturbation Method for Calibrating Stochastic Reservoir Models
    Hu, Lin Y.
    MATHEMATICAL GEOSCIENCES, 2008, 40 (08) : 875 - 885
  • [40] Hybrid algorithm for classification of unbalanced datasets
    Han, Min
    Zhu, Xin-Rong
    Kongzhi Lilun Yu Yingyong/Control Theory and Applications, 2011, 28 (10): : 1485 - 1489