APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support

被引:38
作者
Kwong, Jethro C. C. [2 ,3 ]
Khondker, Adree [2 ]
Lajkosz, Katherine [2 ,4 ]
Mcdermott, Matthew B. A. [5 ]
Frigola, Xavier Borrat [6 ,7 ]
Mccradden, Melissa D. [8 ,9 ,10 ]
Mamdani, Muhammad [3 ,11 ]
Kulkarni, Girish S. [2 ,12 ]
Johnson, Alistair E. W. [1 ,3 ,13 ,14 ]
机构
[1] Hosp Sick Children, Child Hlth Evaluat Sci, 686 Bay St, Toronto, ON M5G 0A4, Canada
[2] Univ Toronto, Dept Surg, Div Urol, Toronto, ON, Canada
[3] Univ Toronto, Temerty Ctr AI Res & Educ Med, Toronto, ON, Canada
[4] Univ Toronto, Univ Hlth Network, Dept Biostat, Toronto, ON, Canada
[5] MIT, Dept Biomed Informat, Cambridge, England
[6] Harvard Massachusetts Inst Technol, Div Hlth Sci & Technol, Lab Computat Physiol, Cambridge, MA USA
[7] Hosp Clin Barcelona, Dept Anesthesiol & Crit Care, Barcelona, Spain
[8] Hosp Sick Children, Dept Bioeth, Toronto, ON, Canada
[9] Genet & Genome Biol Res Program, Peter Gilgan Ctr Res & Learning, Toronto, ON, Canada
[10] Univ Toronto, Dalla Lana Sch Publ Hlth, Div Clin & Publ Hlth, Toronto, ON, Canada
[11] Unity Hlth Toronto, Data Sci & Adv Analyt, Toronto, ON, Canada
[12] Univ Toronto, Univ Hlth Network, Princess Margaret Canc Ctr, Toronto, ON, Canada
[13] Univ Toronto, Dalla Lana Sch Publ Hlth, Div Biostat, Toronto, ON, Canada
[14] Univ Toronto, Hosp Sick Children, Child Hlth Evaluat Sci, Toronto, ON, Canada
关键词
ARTIFICIAL-INTELLIGENCE; SEPTIC SHOCK; PREDICTION; GUIDELINES; SEPSIS;
D O I
10.1001/jamanetworkopen.2023.35377
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Importance Artificial intelligence (AI) has gained considerable attention in health care, yet concerns have been raised around appropriate methods and fairness. Current AI reporting guidelines do not provide a means of quantifying overall quality of AI research, limiting their ability to compare models addressing the same clinical question.Objective To develop a tool (APPRAISE-AI) to evaluate the methodological and reporting quality of AI prediction models for clinical decision support.Design, Setting, and Participants This quality improvement study evaluated AI studies in the model development, silent, and clinical trial phases using the APPRAISE-AI tool, a quantitative method for evaluating quality of AI studies across 6 domains: clinical relevance, data quality, methodological conduct, robustness of results, reporting quality, and reproducibility. These domains included 24 items with a maximum overall score of 100 points. Points were assigned to each item, with higher points indicating stronger methodological or reporting quality. The tool was applied to a systematic review on machine learning to estimate sepsis that included articles published until September 13, 2019. Data analysis was performed from September to December 2022.Main Outcomes and Measures The primary outcomes were interrater and intrarater reliability and the correlation between APPRAISE-AI scores and expert scores, 3-year citation rate, number of Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) low risk-of-bias domains, and overall adherence to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement.Results A total of 28 studies were included. Overall APPRAISE-AI scores ranged from 33 (low quality) to 67 (high quality). Most studies were moderate quality. The 5 lowest scoring items included source of data, sample size calculation, bias assessment, error analysis, and transparency. Overall APPRAISE-AI scores were associated with expert scores (Spearman rho, 0.82; 95% CI, 0.64-0.91; P < .001), 3-year citation rate (Spearman rho, 0.69; 95% CI, 0.43-0.85; P < .001), number of QUADAS-2 low risk-of-bias domains (Spearman rho, 0.56; 95% CI, 0.24-0.77; P = .002), and adherence to the TRIPOD statement (Spearman rho, 0.87; 95% CI, 0.73-0.94; P < .001). Intraclass correlation coefficient ranges for interrater and intrarater reliability were 0.74 to 1.00 for individual items, 0.81 to 0.99 for individual domains, and 0.91 to 0.98 for overall scores.Conclusions and Relevance In this quality improvement study, APPRAISE-AI demonstrated strong interrater and intrarater reliability and correlated well with several study quality measures. This tool may provide a quantitative approach for investigators, reviewers, editors, and funding organizations to compare the research quality across AI studies for clinical decision support.
引用
收藏
页数:11
相关论文
共 39 条
[1]  
Altman DG, 2000, STAT MED, V19, P453, DOI 10.1002/(SICI)1097-0258(20000229)19:4<453::AID-SIM350>3.3.CO
[2]  
2-X
[3]   Sample-Size Determination Methodologies for Machine Learning in Medical Imaging Research: A Systematic Review [J].
Balki, Indranil ;
Amirabadi, Afsaneh ;
Levman, Jacob ;
Martel, Anne L. ;
Emersic, Ziga ;
Meden, Blaz ;
Garcia-Pedrero, Angel ;
Ramirez, Saul C. ;
Kong, Dehan ;
Moody, Alan R. ;
Tyrrell, Pascal N. .
CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2019, 70 (04) :344-353
[4]   Challenges to the Reproducibility of Machine Learning Models in Health Care [J].
Beam, Andrew L. ;
Manrai, Arjun K. ;
Ghassemi, Marzyeh .
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2020, 323 (04) :305-306
[5]   Prospective evaluation of an automated method to identify patients with severe sepsis or septic shock in the emergency department [J].
Brown S.M. ;
Jones J. ;
Kuttler K.G. ;
Keddington R.K. ;
Allen T.L. ;
Haug P. .
BMC Emergency Medicine, 16 (1)
[6]  
Caton S, 2020, Arxiv, DOI arXiv:2010.04053
[7]   Evaluation of artificial intelligence on a reference standard based on subjective interpretation Comment [J].
Chen, Po-Hsuan Cameron ;
Mermel, Craig H. ;
Liu, Yun .
LANCET DIGITAL HEALTH, 2021, 3 (11) :E693-E695
[8]   Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence [J].
Collins, Gary S. ;
Dhiman, Paula ;
Andaur Navarro, Constanza L. ;
Ma, Ji ;
Hooft, Lotty ;
Reitsma, Johannes B. ;
Logullo, Patricia ;
Beam, Andrew L. ;
Peng, Lily ;
Van Calster, Ben ;
van Smeden, Maarten ;
Riley, Richard D. ;
Moons, Karel G. M. .
BMJ OPEN, 2021, 11 (07)
[9]  
Collins GS, 2015, J CLIN EPIDEMIOL, V68, P112, DOI [10.7326/M14-0697, 10.7326/M14-0698, 10.1002/bjs.9736, 10.1016/j.eururo.2014.11.025, 10.1111/eci.12376, 10.1186/s12916-014-0241-z, 10.1016/j.jclinepi.2014.11.010, 10.1136/bmj.g7594, 10.1038/bjc.2014.639]
[10]   Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review [J].
Dhiman, Paula ;
Ma, Jie ;
Navarro, Constanza L. Andaur ;
Speich, Benjamin ;
Bullock, Garrett ;
Damen, Johanna A. A. ;
Hooft, Lotty ;
Kirtley, Shona ;
Riley, Richard D. ;
Van Calster, Ben ;
Moons, Karel G. M. ;
Collins, Gary S. .
BMC MEDICAL RESEARCH METHODOLOGY, 2022, 22 (01)