A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?

被引:0
|
作者
Eren, Camur [1 ]
Turay, Cesur [2 ]
Celal, Guenes Yasin [3 ]
机构
[1] 29 Mayis State Hosp, Minist Hlth Ankara, Dept Radiol, Cd 312, TR-06105 Ankara, Turkiye
[2] Ankara Mamak State Hosp, Dept Psychiat, TR-06270 Ankara, Turkiye
[3] TC Saglik Bakanligi Kirikkale Yuksek Ihtisas Hasta, Dept Radiol, Kirikkale, Turkiye
关键词
Prostate; Large language models; Artificial intelligence; PI-RADS; Performance;
D O I
10.1007/s40846-024-00914-3
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Purpose This study evaluates the accuracy of various large language models (LLMs) and compares them with radiologists in answering multiple-choice questions (MCQs) related to Prostate Imaging-Reporting and Data System version 2.1 (PI-RADSv2.1). Methods This cross-sectional study utilizes one-hundred MCQs covering all sections of PI-RADSv2.1 were prepared and asked twelve different LLMs, including Claude 3 Opus, Claude Sonnet, ChatGPT models (ChatGPT 4o, ChatGPT 4 Turbo, ChatGPT 4, ChatGPT 3.5), Google Gemini models (Gemini 1.5 pro, Gemini 1.0), Microsoft Copilot, Perplexity, Meta Llama 3 70B, and Mistral Large. Two board-certified (EDiR) radiologists (radiologist 1,2) also answered the questions independently. Non-parametric tests were used for statistical analysis due to the non-normal distribution of data. Results Claude 3 Opus achieved the highest accuracy rate (85%) among the LLMs, followed by ChatGPT 4 Turbo (82%) and ChatGPT 4o (80%), ChatGPT 4 (79%), Gemini 1.5pro (79%) both radiologists (79% each). There was no significant difference in performance among Claude 3 Opus, ChatGPT 4 models, Gemini 1.5 Pro, and radiologists (p > 0.05). Conclusion The fact that Claude 3 Opus shows better results than all other LLMs (including the newest ChatGPT 4o) raises the question of whether it could be a new game changer among LLMs. The high accuracy rates of Claude 3 Opus, ChatGPT 4 models, and Gemini 1.5 Pro, comparable to those of radiologists, highlight their potential as clinical decision support tools. This study highlights the potential of LLMs in radiology, suggesting a transformative impact on diagnostic accuracy and efficiency.
引用
收藏
页码:821 / 830
页数:10
相关论文
共 50 条
  • [41] Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivityof Large Language Models to Surgical Patient Questions:Cross-Sectional Study
    Dagli, Mert Marcel
    Oettl, Felix Conrad
    Ujral, Jaskeerat
    Malhotra, Kashish
    Ghenbot, Yohannes
    Yoon, Jang W.
    Ozturk, Ali K.
    Welch, William C.
    JMIR FORMATIVE RESEARCH, 2024, 8
  • [42] Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study
    Sui, Yuan
    Zhou, Mengyu
    Zhou, Mingjie
    Han, Shi
    Zhang, Dongmei
    PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 645 - 654
  • [43] Enhancement of the Performance of Large Language Models inDiabetes Education through Retrieval-Augmented Generation:Comparative Study
    Wang, Dingqiao
    Liang, Jiangbo
    Ye, Jinguo
    Li, Jingni
    Li, Jingpeng
    Zhang, Qikai
    Hu, Qiuling
    Pan, Caineng
    Wang, Dongliang
    Liu, Zhong
    Shi, Wen
    Shi, Danli
    Li, Fei
    Qu, Bo
    Zheng, Yingfeng
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [44] Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study
    Kim, Hak-Sun
    Kim, Gyu-Tae
    JOURNAL OF DENTAL SCIENCES, 2025, 20 (02) : 895 - 900
  • [45] Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study
    Wang, Lei
    Bi, Wenshuai
    Zhao, Suling
    Ma, Yinyao
    Lv, Longting
    Meng, Chenwei
    Fu, Jingru
    Lv, Hanlin
    JMIR FORMATIVE RESEARCH, 2024, 8
  • [46] Accuracy of Large Language Models in Thyroid Nodule-Related Questions Based on the Korean Thyroid Imaging Reporting and Data System (K-TIRADS)
    Kaba, Esat
    Hursoy, Nur
    Solak, Merve
    Celiker, Fatma Beyazal
    KOREAN JOURNAL OF RADIOLOGY, 2024, 25 (05) : 499 - 500
  • [47] Comparative analysis of BERT-based and generative large language models for detecting suicidal ideation: a performance evaluation study
    de Oliveira, Adonias Caetano
    Bessa, Renato Freitas
    Soares, Ariel
    CADERNOS DE SAUDE PUBLICA, 2024, 40 (10):
  • [48] Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition
    Gunes, Yasin Celal
    Cesur, Turay
    Camur, Eren
    Karabekmez, Leman Gunbey
    DIAGNOSTIC AND INTERVENTIONAL RADIOLOGY, 2025, 31 (02): : 111 - 129
  • [49] Harnessing Large Language Models for Structured Reporting in Breast Ultrasound: A Comparative Study of Open AI (GPT-4.0) and Microsoft Bing (GPT-4)
    Liu, ChaoXu
    Wei, MinYan
    Qin, Yu
    Zhang, MeiXiang
    Jiang, Huan
    Xu, JiaLe
    Zhang, YuNing
    Hua, Qing
    Hou, YiQing
    Dong, YiJie
    Xia, ShuJun
    Li, Ning
    Zhou, JianQiao
    ULTRASOUND IN MEDICINE AND BIOLOGY, 2024, 50 (11) : 1697 - 1703
  • [50] Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence
    Dermata, Anastasia
    Arhakis, Aristidis
    Makrygiannakis, Miltiadis A.
    Giannakopoulos, Kostis
    Kaklamanos, Eleftherios G.
    EUROPEAN ARCHIVES OF PAEDIATRIC DENTISTRY, 2025,