Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists

被引:0
|
作者
Jakub Pristoupil [1 ]
Laura Oleaga [2 ]
Vanesa Junquero [2 ]
Cristina Merino [2 ]
Ozbek Suha Sureyya [3 ]
Martin Kyncl [1 ]
Andrea Burgetova [4 ]
Lukas Lambert [1 ]
机构
[1] Department of Imaging Methods, Motol University Hospital and Second Faculty of Medicine, Charles University, Prague
[2] Department of Radiology, Clinical Diagnostic Imaging Centre, Hospital Clínic de Barcelona, Barcelona
[3] Era Radiology Center, Izmir
[4] Department of Radiology, First Faculty of Medicine, Charles University and General University Hospital in Prague, Prague
关键词
Artificial intelligence; Examination; Natural language processing; Radiology;
D O I
10.1186/s13244-025-01941-7
中图分类号
学科分类号
摘要
Objectives: This study aims to assess the accuracy of generative pre-trained transformer 4o (GPT-4o) in answering multiple response questions from the European Diploma in Radiology (EDiR) examination, comparing its performance to that of human candidates. Materials and methods: Results from 42 EDiR candidates across Europe were compared to those from 26 fourth-year medical students who answered exclusively using the ChatGPT-4o in a prospective study (October 2024). The challenge consisted of 52 recall or understanding-based EDiR multiple-response questions, all without visual inputs. Results: The GPT-4o achieved a mean score of 82.1 ± 3.0%, significantly outperforming the EDiR candidates with 49.4 ± 10.5% (p < 0.0001). In particular, chatGPT-4o demonstrated higher true positive rates while maintaining lower false positive rates compared to EDiR candidates, with a higher accuracy rate in all radiology subspecialties (p < 0.0001) except informatics (p = 0.20). There was near-perfect agreement between GPT-4 responses (κ = 0.872) and moderate agreement among EDiR participants (κ = 0.334). Exit surveys revealed that all participants used the copy-and-paste feature, and 73% submitted additional questions to clarify responses. Conclusions: GPT-4o significantly outperformed human candidates in low-order, text-based EDiR multiple-response questions, demonstrating higher accuracy and reliability. These results highlight GPT-4o’s potential in answering text-based radiology questions. Further research is necessary to investigate its performance across different question formats and candidate populations to ensure broader applicability and reliability. Critical relevance statement: GPT-4o significantly outperforms human candidates in factual radiology text-based questions in the EDiR, excelling especially in identifying correct responses, with a higher accuracy rate compared to radiologists. Key Points: In EDiR text-based questions, ChatGPT-4o scored higher (82%) than EDiR participants (49%). Compared to radiologists, GPT-4o excelled in identifying correct responses. GPT-4o responses demonstrated higher agreement (κ = 0.87) compared to EDiR candidates (κ = 0.33). © The Author(s) 2025.
引用
收藏
相关论文
empty
未找到相关数据