Calibrating Probability with Undersampling for Unbalanced Classification

被引:275
作者
Dal Pozzolo, Andrea [1 ]
Caelen, Olivier [2 ]
Johnson, Reid A. [3 ]
Bontempi, Gianluca [1 ,4 ]
机构
[1] Univ Libre Bruxelles, Machine Learning Grp, Dept Comp Sci, Brussels, Belgium
[2] Worldline SA, Fraud Risk Management Analyt, Brussels, Belgium
[3] Univ Notre Dame, iCeNSA, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[4] Interuniv Inst Bioinformat Brussels IB2, Brussels, Belgium
来源
2015 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI) | 2015年
关键词
D O I
10.1109/SSCI.2015.33
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier [9]. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.
引用
收藏
页码:159 / 166
页数:8
相关论文
共 31 条
[1]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]  
[Anonymous], 2003, Statistical pattern recognition
[3]   Evolving Diverse Ensembles Using Genetic Programming for Classification With Unbalanced Data [J].
Bhowan, Urvesh ;
Johnston, Mark ;
Zhang, Mengjie ;
Yao, Xin .
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2013, 17 (03) :368-386
[4]  
Bishop C.M., 2006, Pattern recognition and machine learning, DOI DOI 10.1007/978-0-387-45528-0
[5]  
Boracchi G., 2015, NEUR NETW IJCNN 2015
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]  
Brier G. W., 1950, Monthly weather review, V78, P1, DOI [DOI 10.1175/1520-0493(1950)078, DOI 10.1175/1520-0493(1950)078ANDLT
[8]  
0001:VOFEITANDGT
[9]  
2.0.CO
[10]  
2, 10.1175/1520-0493(1950)078()0001:VOFEIT()2.0.CO