A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?

被引:0
|
作者
Eren, Camur [1 ]
Turay, Cesur [2 ]
Celal, Guenes Yasin [3 ]
机构
[1] 29 Mayis State Hosp, Minist Hlth Ankara, Dept Radiol, Cd 312, TR-06105 Ankara, Turkiye
[2] Ankara Mamak State Hosp, Dept Psychiat, TR-06270 Ankara, Turkiye
[3] TC Saglik Bakanligi Kirikkale Yuksek Ihtisas Hasta, Dept Radiol, Kirikkale, Turkiye
关键词
Prostate; Large language models; Artificial intelligence; PI-RADS; Performance;
D O I
10.1007/s40846-024-00914-3
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Purpose This study evaluates the accuracy of various large language models (LLMs) and compares them with radiologists in answering multiple-choice questions (MCQs) related to Prostate Imaging-Reporting and Data System version 2.1 (PI-RADSv2.1). Methods This cross-sectional study utilizes one-hundred MCQs covering all sections of PI-RADSv2.1 were prepared and asked twelve different LLMs, including Claude 3 Opus, Claude Sonnet, ChatGPT models (ChatGPT 4o, ChatGPT 4 Turbo, ChatGPT 4, ChatGPT 3.5), Google Gemini models (Gemini 1.5 pro, Gemini 1.0), Microsoft Copilot, Perplexity, Meta Llama 3 70B, and Mistral Large. Two board-certified (EDiR) radiologists (radiologist 1,2) also answered the questions independently. Non-parametric tests were used for statistical analysis due to the non-normal distribution of data. Results Claude 3 Opus achieved the highest accuracy rate (85%) among the LLMs, followed by ChatGPT 4 Turbo (82%) and ChatGPT 4o (80%), ChatGPT 4 (79%), Gemini 1.5pro (79%) both radiologists (79% each). There was no significant difference in performance among Claude 3 Opus, ChatGPT 4 models, Gemini 1.5 Pro, and radiologists (p > 0.05). Conclusion The fact that Claude 3 Opus shows better results than all other LLMs (including the newest ChatGPT 4o) raises the question of whether it could be a new game changer among LLMs. The high accuracy rates of Claude 3 Opus, ChatGPT 4 models, and Gemini 1.5 Pro, comparable to those of radiologists, highlight their potential as clinical decision support tools. This study highlights the potential of LLMs in radiology, suggesting a transformative impact on diagnostic accuracy and efficiency.
引用
收藏
页码:821 / 830
页数:10
相关论文
共 50 条
  • [1] A comparative analysis of large language models on clinical questions for autoimmune diseases
    Chen, Jing
    Ma, Juntao
    Yu, Jie
    Zhang, Weiming
    Zhu, Yijia
    Feng, Jiawei
    Geng, Linyu
    Dong, Xianchi
    Zhang, Huayong
    Chen, Yuxin
    Ning, Mingzhe
    FRONTIERS IN DIGITAL HEALTH, 2025, 7
  • [2] Can large language models reason about medical questions?
    Lievin, Valentin
    Hother, Christoffer Egeberg
    Motzfeldt, Andreas Geert
    Winther, Ole
    PATTERNS, 2024, 5 (03):
  • [3] Comparative Evaluation of the Accuracies of Large Language Models in Answering VI-RADS-Related Questions
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    KOREAN JOURNAL OF RADIOLOGY, 2024, 25 (08) : 767 - 768
  • [4] Evaluating the reliability of the responses of large language models to keratoconus-related questions
    Kayabasi, Mustafa
    Koksaldi, Seher
    Engin, Ceren Durmaz
    CLINICAL AND EXPERIMENTAL OPTOMETRY, 2024,
  • [5] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
    Du, Wei
    Jin, Xueting
    Harris, Jaryse Carol
    Brunetti, Alessandro
    Johnson, Erika
    Leung, Olivia
    Li, Xingchen
    Walle, Selemon
    Yu, Qing
    Zhou, Xiao
    Bian, Fang
    Mckenzie, Kajanna
    Kanathanavanich, Manita
    Ozcelik, Yusuf
    El-Sharkawy, Farah
    Koga, Shunsuke
    ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73
  • [6] Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study
    Workum, Jessica D.
    Volkers, Bas W. S.
    van de Sande, Davy
    Arora, Sumesh
    Goeijenbier, Marco
    Gommers, Diederik
    van Genderen, Michel E.
    CRITICAL CARE, 2025, 29 (01)
  • [7] Comparative Analysis of the Accuracy of Large Language Models in Addressing Common Pulmonary Embolism Patient Questions
    Rosenzveig, Akiva
    Kassab, Joseph
    Sul, Lidiya
    Angelini, Dana
    Chaudhury, Pulkit
    Sarraju, Ashish
    Tefera, Leben
    JOURNAL OF THE AMERICAN HEART ASSOCIATION, 2024, 13 (21):
  • [8] Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study
    Iannantuono, Giovanni Maria
    Bracken-Clarke, Dara
    Karzai, Fatima
    Choo-Wosoba, Hyoyoung
    Gulley, James L.
    Floudas, Charalampos S.
    ONCOLOGIST, 2024, : 407 - 414
  • [9] Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023
    Khalpey, Zain
    Kumar, Ujjawal
    King, Nicholas
    Abraham, Alyssa
    Khalpey, Amina H.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (07)
  • [10] Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
    Wilhelm, Theresa Isabelle
    Roos, Jonas
    Kaczmarczyk, Robert
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25