A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?

被引:0
|
作者
Eren, Camur [1 ]
Turay, Cesur [2 ]
Celal, Guenes Yasin [3 ]
机构
[1] 29 Mayis State Hosp, Minist Hlth Ankara, Dept Radiol, Cd 312, TR-06105 Ankara, Turkiye
[2] Ankara Mamak State Hosp, Dept Psychiat, TR-06270 Ankara, Turkiye
[3] TC Saglik Bakanligi Kirikkale Yuksek Ihtisas Hasta, Dept Radiol, Kirikkale, Turkiye
关键词
Prostate; Large language models; Artificial intelligence; PI-RADS; Performance;
D O I
10.1007/s40846-024-00914-3
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Purpose This study evaluates the accuracy of various large language models (LLMs) and compares them with radiologists in answering multiple-choice questions (MCQs) related to Prostate Imaging-Reporting and Data System version 2.1 (PI-RADSv2.1). Methods This cross-sectional study utilizes one-hundred MCQs covering all sections of PI-RADSv2.1 were prepared and asked twelve different LLMs, including Claude 3 Opus, Claude Sonnet, ChatGPT models (ChatGPT 4o, ChatGPT 4 Turbo, ChatGPT 4, ChatGPT 3.5), Google Gemini models (Gemini 1.5 pro, Gemini 1.0), Microsoft Copilot, Perplexity, Meta Llama 3 70B, and Mistral Large. Two board-certified (EDiR) radiologists (radiologist 1,2) also answered the questions independently. Non-parametric tests were used for statistical analysis due to the non-normal distribution of data. Results Claude 3 Opus achieved the highest accuracy rate (85%) among the LLMs, followed by ChatGPT 4 Turbo (82%) and ChatGPT 4o (80%), ChatGPT 4 (79%), Gemini 1.5pro (79%) both radiologists (79% each). There was no significant difference in performance among Claude 3 Opus, ChatGPT 4 models, Gemini 1.5 Pro, and radiologists (p > 0.05). Conclusion The fact that Claude 3 Opus shows better results than all other LLMs (including the newest ChatGPT 4o) raises the question of whether it could be a new game changer among LLMs. The high accuracy rates of Claude 3 Opus, ChatGPT 4 models, and Gemini 1.5 Pro, comparable to those of radiologists, highlight their potential as clinical decision support tools. This study highlights the potential of LLMs in radiology, suggesting a transformative impact on diagnostic accuracy and efficiency.
引用
收藏
页码:821 / 830
页数:10
相关论文
共 50 条
  • [31] How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini
    Irmici, Giovanni
    Cozzi, Andrea
    Della Pepa, Gianmarco
    De Berardinis, Claudia
    D'Ascoli, Elisa
    Cellina, Michaela
    Ce, Maurizio
    Depretto, Catherine
    Scaperrotta, Gianfranco
    RADIOLOGIA MEDICA, 2024, 129 (10): : 1463 - 1467
  • [32] Exploring the opportunities of large language models for summarizing palliative care consultations: A pilot comparative study
    Chen, Xiao
    Zhou, Wei
    Hoda, Rashina
    Li, Andy
    Bain, Chris
    Poon, Peter
    DIGITAL HEALTH, 2024, 10
  • [33] Evaluating large language models as patient education tools for inflammatory bowel disease: A comparative study
    Zhang, Yan
    Wan, Xiao-Han
    Kong, Qing-Zhou
    Liu, Han
    Liu, Jun
    Guo, Jing
    Yang, Xiao-Yun
    Zuo, Xiu-Li
    Li, Yan-Qing
    WORLD JOURNAL OF GASTROENTEROLOGY, 2025, 31 (06)
  • [34] Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring
    Sessler, Kathrin
    Fuerstenberg, Maurice
    Buehler, Babette
    Kasneci, Enkelejda
    FIFTEENTH INTERNATIONAL CONFERENCE ON LEARNING ANALYTICS & KNOWLEDGE, LAK 2025, 2025, : 462 - 472
  • [35] Using large language models for safety-related table summarization in clinical study reports
    Landman, Rogier
    Healey, Sean P.
    Loprinzo, Vittorio
    Kochendoerfer, Ulrike
    Winnier, Angela Russell
    Henstock, Peter, V
    Lin, Wenyi
    Chen, Aqiu
    Rajendran, Arthi
    Penshanwar, Sushant
    Khan, Sheraz
    Madhavan, Subha
    JAMIA OPEN, 2024, 7 (02)
  • [36] Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy
    Tepe, Murat
    Emekli, Emre
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (05)
  • [37] Clinical Accuracy of Large Language Models and Google Search Responses to Postpartum Depression Questions: Cross-Sectional Study
    Sezgin, Emre
    Chekeni, Faraaz
    Lee, Jennifer
    Keim, Sarah
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [38] Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study
    Kuerbanjiang, Warisijiang
    Peng, Shengzhe
    Jiamaliding, Yiershatijiang
    Yi, Yuexiong
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
  • [39] Large Language Models Versus Expert Clinicians in CrisisPrediction Among Telemental Health Patients:Comparative Study
    Lee, Christine
    Mohebbi, Matthew
    Callaghan, Erin O'
    Winsberg, Mirene
    JMIR MENTAL HEALTH, 2024, 11
  • [40] Accuracy of Prospective Assessments of 4 Large Language ModelChatbot Responses to Patient Questions About Emergency Care:Experimental Comparative Study
    Yau, Jonathan Yi-Shin
    Saadat, Soheil
    Hsu, Edmund
    Murphy, Linda Suk-Ling
    Roh, Jennifer S.
    Suchard, Jeffrey
    Tapia, Antonio
    Wiechman, Warren
    Langdorf, Mark, I
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26 : e60291