Validation and algorithmic audit of a deep learning system for the detection of proximal femoral fractures in patients in the emergency department: a diagnostic accuracy study

被引:48
作者
Oakden-Rayner, Lauren [1 ,2 ]
Gale, William [2 ,3 ]
Bonham, Thomas A. [4 ]
Lungren, Matthew P. [4 ,5 ]
Carneiro, Gustavo [2 ]
Bradley, Andrew P. [6 ]
Palmer, Lyle J. [1 ,2 ]
机构
[1] Univ Adelaide, Sch Publ Hlth, Adelaide, SA, Australia
[2] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA 5000, Australia
[3] Univ Adelaide, Sch Comp Sci, Adelaide, SA, Australia
[4] Stanford Univ, Dept Radiol, Sch Med, Stanford, CA 94305 USA
[5] Stanford Univ, Stanford Artificial Intelligence Med & Imaging Ct, Stanford, CA 94305 USA
[6] Queensland Univ Technol, Sci & Engn Fac, Brisbane, Qld, Australia
来源
LANCET DIGITAL HEALTH | 2022年 / 4卷 / 05期
基金
澳大利亚研究理事会;
关键词
HIP-FRACTURES; RADIOGRAPHS; MORTALITY; SURGERY; DELAY;
D O I
10.1016/S2589-7500(22)00004-8
中图分类号
R-058 [];
学科分类号
摘要
Background Proximal femoral fractures are an important clinical and public health issue associated with substantial morbidity and early mortality. Artificial intelligence might offer improved diagnostic accuracy for these fractures, but typical approaches to testing of artificial intelligence models can underestimate the risks of artificial intelligence-based diagnostic systems. Methods We present a preclinical evaluation of a deep learning model intended to detect proximal femoral fractures in frontal x-ray films in emergency department patients, trained on films from the Royal Adelaide Hospital (Adelaide, SA, Australia). This evaluation included a reader study comparing the performance of the model against five radiologists (three musculoskeletal specialists and two general radiologists) on a dataset of 200 fracture cases and 200 non-fractures (also from the Royal Adelaide Hospital), an external validation study using a dataset obtained from Stanford University Medical Center, CA, USA, and an algorithmic audit to detect any unusual or unexpected model behaviour. Findings In the reader study, the area under the receiver operating characteristic curve (AUC) for the performance of the deep learning model was 0.994 (95% CI 0.988-0.999) compared with an AUC of 0.969 (0.960-0.978) for the five radiologists. This strong model performance was maintained on external validation, with an AUC of 0.980 (0.931-1.000). However, the preclinical evaluation identified barriers to safe deployment, including a substantial shift in the model operating point on external validation and an increased error rate on cases with abnormal bones (eg, Paget's disease). Interpretation The model outperfonned the radiologists tested and maintained performance on external validation, but showed several unexpected limitations during further testing. Thorough preclinical evaluation of artificial intelligence models, including algorithmic auditing, can reveal unexpected and potentially harmful behaviour even in high-performance artificial intelligence systems, which can inform future clinical testing and deployment decisions.
引用
收藏
页码:E351 / E358
页数:8
相关论文
共 34 条
[1]  
Adebayo J, 2018, ADV NEUR IN, V31
[2]   Incidence and Mortality of Hip Fractures in the United States [J].
Brauer, Carmen A. ;
Coca-Perraillon, Marcelo ;
Cutler, David M. ;
Rosen, Allison B. .
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2009, 302 (14) :1573-1579
[3]   IMAGING CHOICES IN OCCULT HIP FRACTURE [J].
Cannon, Jesse ;
Silvestri, Salvatore ;
Munro, Mark .
JOURNAL OF EMERGENCY MEDICINE, 2009, 37 (02) :144-152
[4]   COMPARING THE AREAS UNDER 2 OR MORE CORRELATED RECEIVER OPERATING CHARACTERISTIC CURVES - A NONPARAMETRIC APPROACH [J].
DELONG, ER ;
DELONG, DM ;
CLARKEPEARSON, DI .
BIOMETRICS, 1988, 44 (03) :837-845
[5]   Prevalence of traumatic hip and pelvic fractures in patients with suspected hip fracture and negative initial standard radiographs - A study of emergency department patients [J].
Dominguez, S ;
Liu, P ;
Roberts, C ;
Mandell, M ;
Richman, PB .
ACADEMIC EMERGENCY MEDICINE, 2005, 12 (04) :366-369
[6]  
Gale W, 2017, ARXIV
[7]   Meta-analysis of diagnostic and screening test accuracy evaluations: Methodologic primer [J].
Gatsonis, Constantine ;
Paliwal, Prashni .
AMERICAN JOURNAL OF ROENTGENOLOGY, 2006, 187 (02) :271-281
[8]   Densely Connected Convolutional Networks [J].
Huang, Gao ;
Liu, Zhuang ;
van der Maaten, Laurens ;
Weinberger, Kilian Q. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2261-2269
[9]   GUIDELINES FOR METAANALYSES EVALUATING DIAGNOSTIC-TESTS [J].
IRWIG, L ;
TOSTESON, ANA ;
GATSONIS, C ;
LAU, J ;
COLDITZ, G ;
CHALMERS, TC ;
MOSTELLER, F .
ANNALS OF INTERNAL MEDICINE, 1994, 120 (08) :667-676
[10]   Shenton's line [J].
Jones, D. H. A. .
JOURNAL OF BONE AND JOINT SURGERY-BRITISH VOLUME, 2010, 92B (09) :1312-1315