Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment

被引:6
作者
Beeler, Muhammed Said [1 ]
Oleaga, Laura [2 ]
Junquero, Vanesa [3 ]
Merino, Cristina [3 ]
机构
[1] Kahramanmarae Necip Fazil City Hosp, Kahramanmarae Necip Fazil Sehir Hastanesi, Dept Radiol, TR-46050 Kahramanmaras, Turkiye
[2] Hosp Clin Barcelona, Dept Radiol, C de Villarroel 170, Barcelona 08036, Spain
[3] European Board Radiol, Ave Diagonal 383 Sobreat 1, Barcelona 08008, Spain
关键词
Artificial intelligence; Large language model; GPT-4o; Radiology; Exam;
D O I
10.1016/j.acra.2024.09.005
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Rationale and Objectives: This study aims to evaluate the performance of generative pre-trained transformer (GPT)-4o in the complete official European Board of Radiology (EBR) exam, designed to assess radiology knowledge, skills, and competence. Materials and Methods: Questions based on text, image, or video and in the format of multiple choice, free-text reporting, or image annotation were uploaded into GPT-4o using standardized prompting. The results were compared to the average scores of radiologists taking the exam in real time. Results: In Part 1 (multiple response questions and short cases), GPT-4o outperformed both the radiologists' average scores and the maximum pass score (70.2% vs. 58.4% and 60%, respectively). In Part 2 (clinically oriented reasoning evaluation), the performance of GPT-4o was below both the radiologists' average scores and the minimum pass score (52.9% vs. 66.1% and 55%, respectively). The accuracy on questions involving ultrasound images was higher compared to other imaging modalities (accuracy rate, 87.5-100%). For video-based questions, the performance was 50.6%. The model achieved the highest accuracy on most likely diagnosis questions but showed lower accuracy in free-text reporting and direct anatomical assessment in images (100% vs. 31% and 28.6%, respectively). Conclusion: The abilities of GPT-4o in the official EBR exam are particularly noteworthy. This study demonstrates the potential of large language models to assist radiologists in assessing and managing cases from diagnosis to treatment or follow-up recommendations, even with zero-shot prompting.
引用
收藏
页码:4365 / 4371
页数:7
相关论文
共 25 条
[1]   Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations [J].
Almeida, Leonardo C. ;
Farina, Eduardo M. J. M. ;
Kurilei, Paulo E. A. ;
Abdala, Nitamar ;
Kitamura, Felipe C. .
RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2024, 6 (01)
[2]  
[Anonymous], about us
[3]   Could ChatGPT Pass the UK Radiology Fellowship Examinations? [J].
Ariyaratne, Sisith ;
Jenko, Nathan ;
Davies, A. Mark ;
Iyengar, Karthikeyan P. ;
Botchu, Rajesh .
ACADEMIC RADIOLOGY, 2024, 31 (05) :2178-2182
[4]   The performance of the multimodal large language model GPT-4 on the European board of radiology examination sample test [J].
Besler, Muhammed Said .
JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (08) :927-927
[5]   Integrating Text and Image Analysis: Exploring GPT-4V's Capabilities in Advanced Radiological Applications Across Subspecialties [J].
Busch, Felix ;
Han, Tianyu ;
Makowski, Marcus R. ;
Truhn, Daniel ;
Bressem, Keno K. ;
Adams, Lisa .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[6]   Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard [J].
D'Anna, Gennaro ;
Van Cauter, Sofie ;
Thurnher, Majda ;
Van Goethem, Johan ;
Haller, Sven .
NEURORADIOLOGY, 2024, 66 (08) :1245-1250
[7]   Implementation of the Clinically Oriented Reasoning Evaluation: Impact on the European Diploma in Radiology (EDiR) exam [J].
European Board of Radiology .
INSIGHTS INTO IMAGING, 2020, 11 (01)
[8]   Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy [J].
Gertz, Roman Johannes ;
Dratsch, Thomas ;
Bunck, Alexander Christian ;
Lennartz, Simon ;
Iuga, Andra-Iza ;
Hellmich, Martin Gunnar ;
Persigehl, Thorsten ;
Pennig, Lenhard ;
Gietzen, Carsten Herbert ;
Fervers, Philipp ;
Maintz, David ;
Hahnfeldt, Robert ;
Kottlors, Jonathan .
RADIOLOGY, 2024, 311 (01)
[9]   Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports [J].
Hasani, Amir M. ;
Singh, Shiva ;
Zahergivar, Aryan ;
Ryan, Beth ;
Nethala, Daniel ;
Bravomontenegro, Gabriela ;
Mendhiratta, Neil ;
Ball, Mark ;
Farhadi, Faraz ;
Malayeri, Ashkan .
EUROPEAN RADIOLOGY, 2024, 34 (06) :3566-3574
[10]   Evaluation of Multimodal ChatGPT (GPT-4V) in Describing Mammography Image Features [J].
Haver, Hana ;
Bahl, Manisha ;
Doo, Florence ;
Kamel, Peter ;
Parekh, Vishwa ;
Jeudy, Jean ;
Yi, Paul .
CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2024, 75 (04) :947-949