A discussion of calibration techniques for evaluating binary and categorical predictive models

被引:86
作者
Fenlon, Caroline [1 ]
O'Grady, Luke [2 ]
Doherty, Michael L. [2 ]
Dunnion, John [1 ]
机构
[1] Univ Coll Dublin, Sch Comp Sci, Dublin 4, Ireland
[2] Univ Coll Dublin, Sch Vet Med, Dublin 4, Ireland
关键词
Predictive modelling; Data mining; Evaluation; Calibration; Deviance; Discrimination; OF-FIT TESTS; DAIRY-CATTLE; PERFORMANCE; INSEMINATION; CONCEPTION;
D O I
10.1016/j.prevetmed.2017.11.018
中图分类号
S85 [动物医学(兽医学)];
学科分类号
0906 ;
摘要
Modelling of binary and categorical events is a commonly used tool to simulate epidemiological processes in veterinary research. Logistic and multinomial regression, naive Bayes, decision trees and support vector machines are popular data mining techniques used to predict the probabilities of events with two or more outcomes. Thorough evaluation of a predictive model is important to validate its ability for use in decision-support or broader simulation modelling. Measures of discrimination, such as sensitivity, specificity and receiver operating characteristics, are commonly used to evaluate how well the model can distinguish between the possible outcomes. However, these discrimination tests cannot confirm that the predicted probabilities are accurate and without bias. This paper describes a range of calibration tests, which typically measure the accuracy of predicted probabilities by comparing them to mean event occurrence rates within groups of similar test records. These include overall goodness-of-fit statistics in the form of the Hosmer-Lemeshow and Brier tests. Visual assessment of prediction accuracy is carried out using plots of calibration and deviance (the difference between the outcome and its predicted probability). The slope and intercept of the calibration plot are compared to the perfect diagonal using the unreliability test. Mean absolute calibration error provides an estimate of the level of predictive error. This paper uses sample predictions from a binary logistic regression model to illustrate the use of calibration techniques. Code is provided to perform the tests in the R statistical programming language. The benefits and disadvantages of each test are described. Discrimination tests are useful for establishing a model's diagnostic abilities, but may not suitably assess the model's usefulness for other predictive applications, such as stochastic simulation. Calibration tests may be more informative than discrimination tests for evaluating models with a narrow range of predicted probabilities or overall prevalence close to 50%, which are common in epidemiological applications. Using a suite of calibration tests alongside discrimination tests allows model builders to thoroughly measure their model's predictive capabilities.
引用
收藏
页码:107 / 114
页数:8
相关论文
共 43 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]  
[Anonymous], 2015, arm: data analysis using regression and multilevel/hierarchical models
[3]  
Bavo D C., 2016, CalibrationCurves: Calibration performance. R package version 0.1.2
[4]  
Bischl B, 2016, J MACH LEARN RES, V17
[5]  
Brier GW., 1950, Monthly Weather Review, V78, P1, DOI [DOI 10.1175/1520-0493, 10.1175/1520-0493(, DOI 10.1175/1520-0493(1950)078ANDLT
[6]  
0001:VOFEITANDGT
[7]  
2.0.CO
[8]  
2]
[9]  
Cohen I, 2004, LECT NOTES ARTIF INT, V3202, P125
[10]   Prediction of bulk tank somatic cell count violations based on monthly individual cow somatic cell count data [J].
Fauteux, V. ;
Bouchard, E. ;
Haine, D. ;
Scholl, D. T. ;
Roy, J. P. .
JOURNAL OF DAIRY SCIENCE, 2015, 98 (04) :2312-2321