Empirical analysis of performance assessment for imbalanced classification

被引:18
作者
Gaudreault, Jean-Gabriel [1 ]
Branco, Paula [1 ]
机构
[1] Univ Ottawa, Sch Elect Engn & Comp Sci, Ottawa, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Imbalanced learning; Performance metrics; Performance evaluation; ROC; AREA;
D O I
10.1007/s10994-023-06497-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There are multiple scenarios in machine learning where the data used presents a heavy bias towards one of the classes. Evaluating the performance of machine learning models in such imbalanced scenarios proves to be difficult and challenging, as one of the classes is poorly represented in the data, and this class is often more relevant to the end-user. An abundance of performance metrics have been devised and commonly used in order to solve these specific problems, however, there is often a lack of common agreement on which metric is best and which to use in specific imbalanced scenarios. In this study, we experimentally study the impact of choosing one metric over another in the evaluation of a classifier for binary classification, as well as the effect of data characteristics such as class imbalance and noise on those metrics. Based on our extensive empirical analysis, we provide a set of easy-to-follow guidelines for which performance metric is best to use depending on the context of the problem. Specifically, we highlight the importance of using multiple different metrics which are fundamentally different in imbalanced domains, we also display results on why the usage of Davis' interpolation of the area under the precision-recall curve and the Matthews Correlation Coefficient metrics should be preferred over other similar metrics, as well as why the usage of geometric mean and F-1 score should be avoided in scenarios likely to present noise on the labels.
引用
收藏
页码:5533 / 5575
页数:43
相关论文
共 36 条
[1]   A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework [J].
Aguiar, Gabriel ;
Krawczyk, Bartosz ;
Cano, Alberto .
MACHINE LEARNING, 2024, 113 (07) :4165-4243
[2]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[3]   A Survey of Predictive Modeling on Im balanced Domains [J].
Branco, Paula ;
Torgo, Luis ;
Ribeiro, Rita P. .
ACM COMPUTING SURVEYS, 2016, 49 (02)
[4]   The Matthews Correlation Coefficient (MCC) is More Informative Than Cohen's Kappa and Brier Score in Binary Classification Assessment [J].
Chicco, Davide ;
Warrens, Matthijs J. ;
Jurman, Giuseppe .
IEEE ACCESS, 2021, 9 :78368-78381
[5]   Learning from imbalanced data in surveillance of nosocomial infection [J].
Cohen, Gilles ;
Hilario, Melanie ;
Sax, Hugo ;
Hugonnet, Stephane ;
Geissbuhler, Antoine .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2006, 37 (01) :7-18
[6]   A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES [J].
COHEN, J .
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) :37-46
[7]  
Davis JS, 2006, PROCEEDINGS OF THE 1ST INTERNATIONAL CONFERENCE ON THE ECOLOGICAL IMPORTANCE OF SOLAR SALTWORKS, P5
[8]   Why Cohen's Kappa should be avoided as performance measure in classification [J].
Delgado, Rosario ;
Tibau, Xavier-Andoni .
PLOS ONE, 2019, 14 (09)
[9]  
Egan J.P., 1975, Signal Detection Theory and ROC-Analysis. Series in cognition and perception
[10]   Novelty detection in data streams [J].
Faria, Elaine R. ;
Goncalves, Isabel J. C. R. ;
de Carvalho, Andre C. P. L. F. ;
Gama, Joao .
ARTIFICIAL INTELLIGENCE REVIEW, 2016, 45 (02) :235-269