A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support

被引:3
作者
Miller, Luke [1 ]
Kamel, Peter [1 ]
Patel, Jigar [2 ]
Agrawal, Jay [1 ]
Zhan, Min [3 ]
Bumbarger, Nathan [1 ]
Wang, Kenneth [2 ]
机构
[1] Univ Maryland, Dept Radiol, Med Ctr, Baltimore, MD 21201 USA
[2] Baltimore VA Med Ctr, Dept Radiol, Baltimore, MD USA
[3] Univ Maryland, Sch Med, Epidemiol & Publ Hlth, Baltimore, MD USA
来源
JOURNAL OF IMAGING INFORMATICS IN MEDICINE | 2025年 / 38卷 / 04期
关键词
Imaging utilization; GPT4; ChatGPT; Bard; LLM; CT HEAD RULE; APPROPRIATENESS CRITERIA;
D O I
10.1007/s10278-024-01161-3
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Imaging utilization has increased dramatically in recent years, and at least some of these studies are not appropriate for the clinical scenario. The development of large language models (LLMs) may address this issue by providing a more accessible reference resource for ordering providers, but their relative performance is currently understudied. Evaluate and compare the relative appropriateness and usefulness of imaging recommendations generated by eight publicly available models in response to neuroradiology clinical scenarios. Twenty-four common neuroradiology clinical scenarios were selected which often yield suboptimal imaging utilization. Questions were crafted to assess the ability of LLMs to provide accurate and actionable advice. The LLMs were assessed in August 2023 using natural-language 1-2 sentence queries requesting advice about optimal image ordering given certain clinical parameters. Eight of the most well-known LLMs were chosen for evaluation: ChatGPT, GPT4, Bard (Versions 1 and 2), Bing Chat, Llama 2, Perplexity, and Claude. The models were graded by three fellowship-trained neuroradiologists on whether their advice was "optimal" or "not optimal" according to the ACR Appropriateness Criteria or the New Orleans Head CT Criteria. The raters also ranked the models based on the appropriateness, helpfulness, concision, and source-citations in their response. The models varied in their ability to deliver an "optimal" recommendation based on these scenarios as follows: ChatGPT (20/24), GPT4 (23/24), Bard 1 (13/24), Bard 2 (14/24), Bing Chat (14/24), Llama (5/24), Perplexity (19/24), and Claude (19/24). The median ranks of the LLMs were as follows: ChatGPT (3), GPT4 (1.5), Bard 1 (4.5), Bard 2 (5), Bing Chat (6), Llama (7.5), Perplexity (4), and Claude (3). Characteristic errors are described and discussed. GPT-4, ChatGPT, and Claude generally outperformed Bard, Bing Chat, and Llama 2. This study evaluates the performance of a greater variety of publicly available LLMs in settings that more closely mimic real-world use cases as well as discussing the practical challenges of doing so. This is the first study to evaluate and compare a wide range of publicly available LLMs to determine appropriateness of their neuroradiology imaging recommendations.
引用
收藏
页码:2294 / 2302
页数:9
相关论文
共 29 条
[1]  
acr, HEADACHE
[2]  
acr, HEAD TRAUMA
[3]  
acr, HEARING LOSS ANDOR V
[4]  
acr, SINONASAL DIS
[5]  
acr, ALTERED MENTAL STATU
[6]  
acr, SUSPECTED SPINE TRAU
[7]  
[Anonymous], Appropriateness criteria decision support
[8]  
[Anonymous], 2023, CEREBROVASCULAR DIS
[9]  
[Anonymous], 2021, CEREBROVASCULAR DIS
[10]   Do Clinicians Use the American College of Radiology Appropriateness Criteria in the Management of Their Patients? [J].
Bautista, Andre B. ;
Burgos, Anthony ;
Nickel, Barbara J. ;
Yoon, John J. ;
Tilara, Amish A. ;
Amorosa, Judith K. .
AMERICAN JOURNAL OF ROENTGENOLOGY, 2009, 192 (06) :1581-1585