Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians

被引:57
作者
Dvijotham, Krishnamurthy [1 ]
Winkens, Jim [2 ]
Barsbey, Melih [3 ]
Ghaisas, Sumedh [4 ]
Stanforth, Robert [4 ]
Pawlowski, Nick [5 ]
Strachan, Patricia [6 ]
Ahmed, Zahra [4 ]
Azizi, Shekoofeh [7 ]
Bachrach, Yoram [4 ]
Culp, Laura [7 ]
Daswani, Mayank [6 ]
Freyberg, Jan [6 ]
Kelly, Christopher [6 ]
Kiraly, Atilla [8 ]
Kohlberger, Timo [8 ]
McKinney, Scott [9 ]
Mustafa, Basil [10 ]
Natarajan, Vivek [8 ]
Geras, Krzysztof [11 ]
Witowski, Jan [11 ]
Qin, Zhi Zhen [12 ]
Creswell, Jacob [12 ]
Shetty, Shravya [8 ]
Sieniek, Marcin [8 ]
Spitz, Terry [6 ]
Corrado, Greg [8 ]
Kohli, Pushmeet [4 ]
Cemgil, Taylan [4 ]
Karthikesalingam, Alan [6 ]
机构
[1] Google DeepMind, Mountain View, CA 94043 USA
[2] Google Res, New York, NY 10065 USA
[3] Bogazici Univ, Istanbul, Turkiye
[4] Google DeepMind, London, England
[5] Microsoft Res, Cambridge, England
[6] Google Res, London, England
[7] Google DeepMind, Toronto, ON, Canada
[8] Google Res, Palo Alto, CA USA
[9] OpenAI, San Francisco, CA USA
[10] Google DeepMind, Zurich, Switzerland
[11] NYU, Grossman Sch Med, New York, NY USA
[12] Stop TB Partnership, Geneva, Switzerland
关键词
TESTS;
D O I
10.1038/s41591-023-02437-x
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A collaboration system helps to integrate decisions between human experts and AI to optimize screening and triaging and to reduce clinicians' workload. Predictive artificial intelligence (AI) systems based on deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings, but can make errors in cases accurately diagnosed by clinicians and vice versa. We developed Complementarity-Driven Deferral to Clinical Workflow (CoDoC), a system that can learn to decide between the opinion of a predictive AI model and a clinical workflow. CoDoC enhances accuracy relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis (TB). For breast cancer screening, compared to double reading with arbitration in a screening program in the UK, CoDoC reduced false positives by 25% at the same false-negative rate, while achieving a 66% reduction in clinician workload. For TB triaging, compared to standalone AI and clinical workflows, CoDoC achieved a 5-15% reduction in false positives at the same false-negative rate for three of five commercially available predictive AI systems. To facilitate the deployment of CoDoC in novel futuristic clinical settings, we present results showing that CoDoC's performance gains are sustained across several axes of variation (imaging modality, clinical setting and predictive AI system) and discuss the limitations of our evaluation and where further validation would be needed. We provide an open-source implementation to encourage further research and application.
引用
收藏
页码:1814 / 1820
页数:14
相关论文
共 39 条
[1]   Adjusting for multiple testing when reporting research results: The Bonferroni vs Holm methods [J].
Aickin, M ;
Gensler, H .
AMERICAN JOURNAL OF PUBLIC HEALTH, 1996, 86 (05) :726-728
[2]  
[Anonymous], 2019, GUIDANCE SCREENING S
[3]   End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography [J].
Ardila, Diego ;
Kiraly, Atilla P. ;
Bharadwaj, Sujeeth ;
Choi, Bokyung ;
Reicher, Joshua J. ;
Peng, Lily ;
Tse, Daniel ;
Etemadi, Mozziyar ;
Ye, Wenxing ;
Corrado, Greg ;
Naidich, David P. ;
Shetty, Shravya .
NATURE MEDICINE, 2019, 25 (06) :954-+
[4]   Big Self-Supervised Models Advance Medical Image Classification [J].
Azizi, Shekoofeh ;
Mustafa, Basil ;
Ryan, Fiona ;
Beaver, Zachary ;
Freyberg, Jan ;
Deaton, Jonathan ;
Loh, Aaron ;
Karthikesalingam, Alan ;
Kornblith, Simon ;
Chen, Ting ;
Natarajan, Vivek ;
Norouzi, Mohammad .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :3458-3468
[5]  
Charusaie Mohammad-Amin, 2022, P MACHINE LEARNING R
[6]  
D'Amour A, 2022, J MACH LEARN RES, V23
[7]  
European Commission, 2019, USE DOUBL READ MAMM
[8]   Recommended tests and confidence intervals for paired binomial proportions [J].
Fagerland, Morten W. ;
Lydersen, Stian ;
Laake, Petter .
STATISTICS IN MEDICINE, 2014, 33 (16) :2850-2875
[9]  
Fan J., 1994, Journal of computational and graphical statistics, V3, P35
[10]   Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy [J].
Freeman, Karoline ;
Geppert, Julia ;
Stinton, Chris ;
Todkill, Daniel ;
Johnson, Samantha ;
Clarke, Aileen ;
Taylor-Phillips, Sian .
BMJ-BRITISH MEDICAL JOURNAL, 2021, 374