Challenges in the real world use of classification accuracy metrics: From recall and precision to the Matthews correlation coefficient

被引:47
作者
Foody, Giles M. [1 ]
机构
[1] Univ Nottingham, Sch Geog, Nottingham, Notts, England
关键词
DIAGNOSTIC-TESTS; CONDITIONAL DEPENDENCE; PREVALENCE; DISEASE; AREA; SPECIFICITY; SENSITIVITY; PERFORMANCE; STATISTICS; ISSUES;
D O I
10.1371/journal.pone.0291908
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The accuracy of a classification is fundamental to its interpretation, use and ultimately decision making. Unfortunately, the apparent accuracy assessed can differ greatly from the true accuracy. Mis-estimation of classification accuracy metrics and associated mis-interpretations are often due to variations in prevalence and the use of an imperfect reference standard. The fundamental issues underlying the problems associated with variations in prevalence and reference standard quality are revisited here for binary classifications with particular attention focused on the use of the Matthews correlation coefficient (MCC). A key attribute claimed of the MCC is that a high value can only be attained when the classification performed well on both classes in a binary classification. However, it is shown here that the apparent magnitude of a set of popular accuracy metrics used in fields such as computer science medicine and environmental science (Recall, Precision, Specificity, Negative Predictive Value, J, F1, likelihood ratios and MCC) and one key attribute (prevalence) were all influenced greatly by variations in prevalence and use of an imperfect reference standard. Simulations using realistic values for data quality in applications such as remote sensing showed each metric varied over the range of possible prevalence and at differing levels of reference standard quality. The direction and magnitude of accuracy metric mis-estimation were a function of prevalence and the size and nature of the imperfections in the reference standard. It was evident that the apparent MCC could be substantially under- or over-estimated. Additionally, a high apparent MCC arose from an unquestionably poor classification. As with some other metrics of accuracy, the utility of the MCC may be overstated and apparent values need to be interpreted with caution. Apparent accuracy and prevalence values can be mis-leading and calls for the issues to be recognised and addressed should be heeded.
引用
收藏
页数:27
相关论文
共 71 条
[21]   Impacts of Sample Design for Validation Data on the Accuracy of Feedforward Neural Network Classification [J].
Foody, Giles M. .
APPLIED SCIENCES-BASEL, 2017, 7 (09)
[22]   The Sensitivity of Mapping Methods to Reference Data Quality: Training Supervised Image Classifications with Imperfect Reference Data [J].
Foody, Giles M. ;
Pal, Mahesh ;
Rocchini, Duccio ;
Garzon-Lopez, Carol X. ;
Bastin, Lucy .
ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2016, 5 (11)
[23]   Rating crowdsourced annotations: evaluating contributions of variable quality and completeness [J].
Foody, Giles M. .
INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2014, 7 (08) :650-670
[24]   Approaches for the production and evaluation of fuzzy land cover classifications from remotely-sensed data [J].
Foody, GM .
INTERNATIONAL JOURNAL OF REMOTE SENSING, 1996, 17 (07) :1317-1340
[25]   Status of land cover classification accuracy assessment [J].
Foody, GM .
REMOTE SENSING OF ENVIRONMENT, 2002, 80 (01) :185-201
[26]  
Fu Y, 2022, NEUROCOMPUTING
[27]   Conditional dependence between tests affects the diagnosis and surveillance of animal diseases [J].
Gardner, IA ;
Stryhn, H ;
Lind, P ;
Collins, MT .
PREVENTIVE VETERINARY MEDICINE, 2000, 45 (1-2) :107-122
[28]   COMPARISON OF A SCREENING TEST AND A REFERENCE TEST IN EPIDEMIOLOGIC STUDIES .2. A PROBABILISTIC MODEL FOR COMPARISON OF DIAGNOSTIC TESTS [J].
GART, JJ ;
BUCK, AA .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 1966, 83 (03) :593-+
[29]   A HEMODYNAMIC ECHOCARDIOGRAPHIC EVALUATION PREDICTS PROLONGED MECHANICAL VENTILATION IN SEPTIC PATIENTS: A PILOT STUDY [J].
Giraldi, Tiago ;
Fernandes, Dario Cecilio ;
Matos-Souza, Jose Roberto ;
Santos, Thiago Martins .
ULTRASOUND IN MEDICINE AND BIOLOGY, 2023, 49 (02) :626-634
[30]   Accuracy statistics for judging soft classification [J].
Gomez, D. ;
Biging, G. ;
Montero, J. .
INTERNATIONAL JOURNAL OF REMOTE SENSING, 2008, 29 (03) :693-709