Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation

被引：126

作者：

Carrington, Andre M. ^{[1
]}

Manuel, Douglas G. ^{[2
,3
]}

Fieguth, Paul W. ^{[4
,5
]}

Ramsay, Tim ^{[6
]}

Osmani, Venet ^{[7
]}

Wernly, Bernhard ^{[8
]}

Bennett, Carol ^{[9
]}

Hawken, Steven ^{[6
]}

Magwood, Olivia ^{[10
]}

Sheikh, Yusuf

McInnes, Matthew ^{[6
]}

Holzinger, Andreas ^{[11
,12
]}

机构：

[1] Univ Waterloo, Ottawa Hosp & Reg Imaging Associates, Dept Syst Design Engn, Waterloo, ON N2L 3G1, Canada

[2] Ottawa Hosp Res Inst, Inst Clin Evaluat Sci, Ottawa, ON K1N 6N5, Canada

[3] Bruyere Res Inst, Ottawa, ON K1R 6M1, Canada

[4] Univ Waterloo, Dept Syst Design Engn, Waterloo, ON N2L 3G1, Canada

[5] Univ Waterloo, Fac Engn, Waterloo, ON N2L 3G1, Canada

[6] Univ Ottawa, Ottawa Hosp Res Inst, Ottawa, ON K1N 6N5, Canada

[7] Univ Trento, Fdn Bruno Kessler Res Inst, Dept Psychol & Cognit Sci, I-38122 Trento, TN, Italy

[8] Paracelsus Med Univ Salzburg, Dept Cardiol, A-5020 Salzburg, Austria

[9] Ottawa Hosp Res Inst, Inst Clin Evaluat Sci, Ottawa, ON K1N 6N5, Canada

[10] Univ Ottawa, Bruyere Res Inst, Ottawa, ON K1N 6N5, Canada

[11] Univ Alberta, Alberta Machine Intelligence Inst, Edmonton, AB T6G 2R3, Canada

[12] Med Univ Graz, Human Ctr Lab, A-8036 Graz, Austria

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 01期

基金：

奥地利科学基金会;

关键词：

Performance and reliability; performance analysis and design aids; diagnostic testing; artificial intelligence; ROC; AUC; C statistic; explainable AI; equity; audit; PREDICTION MODELS; PARTIAL AREA; CURVE; PERFORMANCE; TESTS;

D O I：

10.1109/TPAMI.2022.3145392

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Optimal performance is desired for decision-making in any field with binary classifiers and diagnostic tests, however common performance measures lack depth in information. The area under the receiver operating characteristic curve (AUC) and the area under the precision recall curve are too general because they evaluate all decision thresholds including unrealistic ones. Conversely, accuracy, sensitivity, specificity, positive predictive value and the F1 score are too specific-they are measured at a single threshold that is optimal for some instances, but not others, which is not equitable. In between both approaches, we propose deep ROC analysis to measure performance in multiple groups of predicted risk (like calibration), or groups of true positive rate or false positive rate. In each group, we measure the group AUC (properly), normalized group AUC, and averages of: sensitivity, specificity, positive and negative predictive value, and likelihood ratio positive and negative. The measurements can be compared between groups, to whole measures, to point measures and between models. We also provide a new interpretation of AUC in whole or part, as balanced average accuracy, relevant to individuals instead of pairs. We evaluate models in three case studies using our method and Python toolkit and confirm its utility.

引用

页码：329 / 341

页数：13

共 64 条

[1] Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable [J].

Austin, Peter C. ;

Steyerberg, Ewout W. .

BMC MEDICAL RESEARCH METHODOLOGY, 2012, 12

[2]

Bishop C. M., 2007, Pattern Recognition and Machine Learning Information Science and Statistics, V1st

[3] Half-AUC for the evaluation of sensitive or specific classifiers [J].

Bradley, Andrew P. .

PATTERN RECOGNITION LETTERS, 2014, 38 :93-98

[4] The use of the area under the roc curve in the evaluation of machine learning algorithms [J].

Bradley, AP .

PATTERN RECOGNITION, 1997, 30 (07) :1145-1159

[5] Measures of Model Interpretability for Model Selection [J].

Carrington, Andre ;

Fieguth, Paul ;

Chen, Helen .

MACHINE LEARNING AND KNOWLEDGE EXTRACTION, CD-MAKE 2018, 2018, 11015 :329-349

[6] A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms [J].

Carrington, Andre M. ;

Fieguth, Paul W. ;

Qazi, Hammad ;

Holzinger, Andreas ;

Chen, Helen H. ;

Mayr, Franz ;

Manuel, Douglas G. .

BMC MEDICAL INFORMATICS AND DECISION MAKING, 2020, 20 (01)

[7] STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration [J].

Cohen, Jeremie F. ;

Korevaar, Daniel A. ;

Altman, Douglas G. ;

Bruns, David E. ;

Gatsonis, Constantine A. ;

Hooft, Lotty ;

Irwig, Les ;

Levine, Deborah ;

Reitsma, Johannes B. ;

de Vet, Henrica C. W. ;

Bossuyt, Patrick M. M. .

BMJ OPEN, 2016, 6 (11)

[8]

Collins GS, 2015, ANN INTERN MED, V162, P55, DOI [10.1136/bmj.g7594, 10.1016/j.jclinepi.2014.11.010, 10.7326/M14-0697, 10.1038/bjc.2014.639, 10.7326/M14-0698, 10.1016/j.eururo.2014.11.025, 10.1186/s12916-014-0241-z, 10.1002/bjs.9736]

[9] Statistical evaluation of prognostic versus diagnostic models: Beyond the ROC curve [J].

Cook, Nancy R. .

CLINICAL CHEMISTRY, 2008, 54 (01) :17-23

[10]

de Leeuw J, 2009, J STAT SOFTW, V32, P1

← 1 2 3 4 5 6 7 →