A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?

被引:0
|
作者
Eren, Camur [1 ]
Turay, Cesur [2 ]
Celal, Guenes Yasin [3 ]
机构
[1] 29 Mayis State Hosp, Minist Hlth Ankara, Dept Radiol, Cd 312, TR-06105 Ankara, Turkiye
[2] Ankara Mamak State Hosp, Dept Psychiat, TR-06270 Ankara, Turkiye
[3] TC Saglik Bakanligi Kirikkale Yuksek Ihtisas Hasta, Dept Radiol, Kirikkale, Turkiye
关键词
Prostate; Large language models; Artificial intelligence; PI-RADS; Performance;
D O I
10.1007/s40846-024-00914-3
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Purpose This study evaluates the accuracy of various large language models (LLMs) and compares them with radiologists in answering multiple-choice questions (MCQs) related to Prostate Imaging-Reporting and Data System version 2.1 (PI-RADSv2.1). Methods This cross-sectional study utilizes one-hundred MCQs covering all sections of PI-RADSv2.1 were prepared and asked twelve different LLMs, including Claude 3 Opus, Claude Sonnet, ChatGPT models (ChatGPT 4o, ChatGPT 4 Turbo, ChatGPT 4, ChatGPT 3.5), Google Gemini models (Gemini 1.5 pro, Gemini 1.0), Microsoft Copilot, Perplexity, Meta Llama 3 70B, and Mistral Large. Two board-certified (EDiR) radiologists (radiologist 1,2) also answered the questions independently. Non-parametric tests were used for statistical analysis due to the non-normal distribution of data. Results Claude 3 Opus achieved the highest accuracy rate (85%) among the LLMs, followed by ChatGPT 4 Turbo (82%) and ChatGPT 4o (80%), ChatGPT 4 (79%), Gemini 1.5pro (79%) both radiologists (79% each). There was no significant difference in performance among Claude 3 Opus, ChatGPT 4 models, Gemini 1.5 Pro, and radiologists (p > 0.05). Conclusion The fact that Claude 3 Opus shows better results than all other LLMs (including the newest ChatGPT 4o) raises the question of whether it could be a new game changer among LLMs. The high accuracy rates of Claude 3 Opus, ChatGPT 4 models, and Gemini 1.5 Pro, comparable to those of radiologists, highlight their potential as clinical decision support tools. This study highlights the potential of LLMs in radiology, suggesting a transformative impact on diagnostic accuracy and efficiency.
引用
收藏
页码:821 / 830
页数:10
相关论文
共 50 条
  • [21] Do Large Language Models Produce Diverse Design Concepts? A Comparative Study with Human-Crowdsourced Solutions
    Ma, Kevin
    Grandi, Daniele
    Mccomb, Christopher
    Goucher-Lambert, Kosa
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [22] Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases
    Wang, Xueqi
    Ye, Haiyan
    Zhang, Sumian
    Yang, Mei
    Wang, Xuebin
    JOURNAL OF MEDICAL SYSTEMS, 2025, 49 (01)
  • [23] Large language models and questions from older adults: a human and machine-based evaluation study
    Research Dawadi
    Thien Vu
    Jie Ting Tay
    Phap Tran Ngoc Hoang
    Ai Oya
    Masaki Yamamoto
    Naoki Watanabe
    Yuki Kuriya
    Michihiro Araki
    Discover Artificial Intelligence, 5 (1):
  • [24] Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
    Wu, Yuepeng
    Zhang, Yukang
    Xu, Mei
    Chen, Jinzhi
    Xue, Yican
    Zheng, Yuchen
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2025, 25 (01)
  • [25] Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction
    Chen, Boqi
    Yi, Fandi
    Varro, Daniel
    2023 ACM/IEEE INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS COMPANION, MODELS-C, 2023, : 588 - 596
  • [26] Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models
    MacNeil, Stephen
    Denny, Paul
    Tran, Andrew
    Leinonen, Juho
    Bernstein, Seth
    Hellas, Arto
    Sarsa, Sami
    Kim, Joanne
    PROCEEDINGS OF THE 26TH AUSTRALASIAN COMPUTING EDUCATION CONFERENCE, ACE 2024, 2024, : 11 - 18
  • [27] Evaluation of Responses to Questions About Keratoconus Using ChatGPT-4.0, Google Gemini and Microsoft Copilot: A Comparative Study of Large Language Models on Keratoconus
    Demir, Suleyman
    EYE & CONTACT LENS-SCIENCE AND CLINICAL PRACTICE, 2025, 51 (03): : e107 - e111
  • [28] A Comparative Study of Chatbot Response Generation: Traditional Approaches Versus Large Language Models
    McTear, Michael
    Marokkie, Sheen Varghese
    Bi, Yaxin
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, KSEM 2023, 2023, 14118 : 70 - 79
  • [29] Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination
    Liu, Mingxin
    Okuhara, Tsuyoshi
    Dai, Zhehao
    Huang, Wenbo
    Gu, Lin
    Okada, Hiroko
    Furukawa, Emi
    Kiuchi, Takahiro
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2025, 193
  • [30] Can large language models help predict results from a complex behavioural science study?
    Lippert, Steffen
    Dreber, Anna
    Johannesson, Magnus
    Tierney, Warren
    Cyrus-Lai, Wilson
    Uhlmann, Eric Luis
    Pfeiffer, Thomas
    ROYAL SOCIETY OPEN SCIENCE, 2024, 11 (09):