Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies

被引:555
作者
Nagendran, Myura [1 ]
Chen, Yang [2 ]
Lovejoy, Christopher A. [3 ]
Gordon, Anthony C. [1 ,4 ]
Komorowski, Matthieu [5 ]
Harvey, Hugh [6 ]
Topol, Eric J. [7 ]
Ioannidis, John P. A. [8 ,9 ,10 ,11 ,12 ]
Collins, Gary S. [13 ,14 ]
Maruthappu, Mahiben [3 ]
机构
[1] Imperial Coll London, Dept Surg & Canc, Div Anaesthet Pain Med & Intens Care, London, England
[2] UCL, Inst Cardiovasc Sci, London, England
[3] Cera Care, London, England
[4] Imperial Coll Healthcare NHS Trust, Ctr Perioperat & Crit Care Res, London, England
[5] Imperial Coll London, Dept Bioengn, London, England
[6] Hardian Hlth, London, England
[7] Scripps Res Translat Inst, La Jolla, CA USA
[8] Stanford Univ, Dept Med, Stanford, CA 94305 USA
[9] Stanford Univ, Dept Hlth Res & Policy, Stanford, CA 94305 USA
[10] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94305 USA
[11] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
[12] Stanford Univ, Meta Res Innovat Ctr Stanford METRICS, Stanford, CA 94305 USA
[13] Univ Oxford, Ctr Stat Med, Oxford, England
[14] Oxford Univ Hosp NHS Trust, NIHR Oxford Biomed Res Ctr, Oxford, England
来源
BMJ-BRITISH MEDICAL JOURNAL | 2020年 / 368卷
关键词
PREDICTION MODEL; REDUCING WASTE; APPLICABILITY; EXPLANATION; PROBAST; RISK; BIAS; TOOL; FDA;
D O I
10.1136/bmj.m689
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
OBJECTIVE To systematically examine the design, reporting standards, risk of bias, and claims of studies comparing the performance of diagnostic deep learning algorithms for medical imaging with that of expert clinicians. DESIGN Systematic review. DATA SOURCES Medline, Embase, Cochrane Central Register of Controlled Trials, and the World Health Organization trial registry from 2010 to June 2019. ELIGIBILITY CRITERIA FOR SELECTING STUDIES Randomised trial registrations and non-randomised studies comparing the performance of a deep learning algorithm in medical imaging with a contemporary group of one or more expert clinicians. Medical imaging has seen a growing interest in deep learning research. The main distinguishing feature of convolutional neural networks (CNNs) in deep learning is that when CNNs are fed with raw data, they develop their own representations needed for pattern recognition. The algorithm learns for itself the features of an image that are important for classification rather than being told by humans which features to use. The selected studies aimed to use medical imaging for predicting absolute risk of existing disease or classification into diagnostic groups (eg, disease or non-disease). For example, raw chest radiographs tagged with a label such as pneumothorax or no pneumothorax and the CNN learning which pixel patterns suggest pneumothorax. REVIEW METHODS Adherence to reporting standards was assessed by using CONSORT (consolidated standards of reporting trials) for randomised studies and TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) for non-randomised studies. Risk of bias was assessed by using the Cochrane risk of bias tool for randomised studies and PROBAST (prediction model risk of bias assessment tool) for non-randomised studies. RESULTS Only 10 records were found for deep learning randomised clinical trials, two of which have been published (with low risk of bias, except for lack of blinding, and high adherence to reporting standards) and eight are ongoing. Of 81 non-randomised clinical trials identified, only nine were prospective and just six were tested in a real world clinical setting. The median number of experts in the comparator group was only four (interquartile range 2-9). Full access to all datasets and code was severely limited (unavailable in 95% and 93% of studies, respectively). The overall risk of bias was high in 58 of 81 studies and adherence to reporting standards was suboptimal (<50% adherence for 12 of 29 TRIPOD items). 61 of 81 studies stated in their abstract that performance of artificial intelligence was at least comparable to (or better than) that of clinicians. Only 31 of 81 studies (38%) stated that further prospective studies or trials were required. CONCLUSIONS Few prospective deep learning studies and randomised trials exist in medical imaging. Most nonrandomised trials are not prospective, are at high risk of bias, and deviate from existing reporting standards. Data and code availability are lacking in most studies, and human comparator groups are often small. Future studies should diminish risk of bias, enhance real world clinical relevance, improve reporting and transparency, and appropriately temper conclusions. STUDY REGISTRATION PROSPERO CRD42019123605.
引用
收藏
页数:12
相关论文
共 41 条
  • [31] COX-2 inhibitors - Lessons in drug safety
    Psaty, BM
    Furberg, CD
    [J]. NEW ENGLAND JOURNAL OF MEDICINE, 2005, 352 (11) : 1133 - 1135
  • [32] FDA backs clinician-free AI imaging diagnostic tools
    Ratner, Mark
    [J]. NATURE BIOTECHNOLOGY, 2018, 36 (08) : 673 - 674
  • [33] Exaggerations and Caveats in Press Releases and Health-Related Science News
    Sumner, Petroc
    Vivian-Griffiths, Solveiga
    Boivin, Jacky
    Williams, Andrew
    Bott, Lewis
    Adams, Rachel
    Venetis, Christos A.
    Whelan, Leanne
    Hughes, Bethan
    Chambers, Christopher D.
    [J]. PLOS ONE, 2016, 11 (12):
  • [34] High-performance medicine: the convergence of human and artificial intelligence
    Topol, Eric J.
    [J]. NATURE MEDICINE, 2019, 25 (01) : 44 - 56
  • [35] Artificial Intelligence Algorithms for Medical Prediction Should Be Nonproprietary and Readily Available
    Van Calster, Ben
    Steyerberg, Ewout W.
    Collins, Gary S.
    [J]. JAMA INTERNAL MEDICINE, 2019, 179 (05) : 731 - 731
  • [36] Vollmer S., 2018, Machine learning and AI research for Patient Benefit: 20 Critical Questions on Transparency, Replicability, Ethics and Effectiveness
  • [37] The spread of true and false news online
    Vosoughi, Soroush
    Roy, Deb
    Aral, Sinan
    [J]. SCIENCE, 2018, 359 (6380) : 1146 - +
  • [38] Framework for the impact analysis and implementation of Clinical Prediction Rules (CPRs)
    Wallace, Emma
    Smith, Susan M.
    Perera-Salazar, Rafael
    Vaucher, Paul
    McCowan, Colin
    Collins, Gary
    Verbakel, Jan
    Lakhanpaul, Monica
    Fahey, Tom
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2011, 11
  • [39] Reproducible research practices, transparency, and open access data in the biomedical literature, 2015-2017
    Wallach, Oshua D.
    Boyack, Kevin W.
    Ioannidis, John P. A.
    [J]. PLOS BIOLOGY, 2018, 16 (11)
  • [40] Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study
    Wang, Pu
    Berzin, Tyler M.
    Brown, Jeremy Romek Glissen
    Bharadwaj, Shishira
    Becq, Aymeric
    Xiao, Xun
    Liu, Peixi
    Li, Liangping
    Song, Yan
    Zhang, Di
    Li, Yi
    Xu, Guangre
    Tu, Mengtian
    Liu, Xiaogang
    [J]. GUT, 2019, 68 (10) : 1813 - 1819