Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination

被引：46

作者：

Rosol, Maciej ^{[1
]}

Gasior, Jakub S. ^{[2
]}

Laba, Jonasz ^{[1
]}

Korzeniewski, Kacper ^{[1
]}

Mlynczak, Marcel ^{[1
]}

机构：

[1] Warsaw Univ Technol, Fac Mechatron, Inst Metrol & Biomed Engn, Boboli 8 St, PL-02525 Warsaw, Poland

[2] Med Univ Warsaw, Dept Pediat Cardiol & Gen Pediat, Warsaw, Poland

来源：

SCIENTIFIC REPORTS | 2023年 / 13卷 / 01期

关键词：

D O I：

10.1038/s41598-023-46995-z

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The study aimed to evaluate the performance of two Large Language Models (LLMs): ChatGPT (based on GPT-3.5) and GPT-4 with two temperature parameter values, on the Polish Medical Final Examination (MFE). The models were tested on three editions of the MFE from: Spring 2022, Autumn 2022, and Spring 2023 in two language versions-English and Polish. The accuracies of both models were compared and the relationships between the correctness of answers with the answer's metrics were investigated. The study demonstrated that GPT-4 outperformed GPT-3.5 in all three examinations regardless of the language used. GPT-4 achieved mean accuracies of 79.7% for both Polish and English versions, passing all MFE versions. GPT-3.5 had mean accuracies of 54.8% for Polish and 60.3% for English, passing none and 2 of 3 Polish versions for temperature parameter equal to 0 and 1 respectively while passing all English versions regardless of the temperature parameter value. GPT-4 score was mostly lower than the average score of a medical student. There was a statistically significant correlation between the correctness of the answers and the index of difficulty for both models. The overall accuracy of both models was still suboptimal and worse than the average for medical students. This emphasizes the need for further improvements in LLMs before they can be reliably deployed in medical settings. These findings suggest an increasing potential for the usage of LLMs in terms of medical education.

引用

页数：13

共 50 条

[31] A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?
Nakajima, Nozomu
Fujimori, Takahito
Furuya, Masayuki
Kanie, Yuya
Imai, Hirotatsu
Kita, Kosuke
Uemura, Keisuke
Okada, Seiji
CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (03)
[32] Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions
Moshirfar, Majid
Altaf, Amal W.
Stoakes, Isabella M.
Tuttle, Jared J.
Hoopes, Phillip C.
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (06)
[33] Advancements in AI for Gastroenterology Education: An Assessment of OpenAI's GPT-4 and GPT-3.5 in MKSAP Question Interpretation
Patel, Akash
Samreen, Isha
Ahmed, Imran
AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1580 - S1580
[34] Comparing Vision-Capable Models, GPT-4 and Gemini, With GPT-3.5 on Taiwan's Pulmonologist Exam
Chen, Chih-Hsiung
Hsieh, Kuang-Yu
Huang, Kuo-En
Lai, Hsien-Yun
CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (08)
[35] Toward Improved Radiologic Diagnostics: Investigating the Utility and Limitations of GPT-3.5 Turbo and GPT-4 with Quiz Cases
Kikuchi, Tomohiro
Nakao, Takahiro
Nakamura, Yuta
Hanaoka, Shouhei
Mori, Harushi
Yoshikawa, Takeharu
AMERICAN JOURNAL OF NEURORADIOLOGY, 2024, 45 (10) : 1506 - 1511
[36] Enhancing systematic reviews in orthodontics: a comparative examination of GPT-3.5 and GPT-4 for generating PICO-based queries with tailored prompts and configurations
Demir, Gizem Boztas
Sukut, Yagizalp
Duran, Goekhan Serhat
Topsakal, Kubra Gulnur
Gorgulu, Serkan
EUROPEAN JOURNAL OF ORTHODONTICS, 2024, 46 (02)
[37] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
Jamil S. Samaan
Nithya Rajeev
Wee Han Ng
Nitin Srinivasan
Jonathan A. Busam
Yee Hui Yeo
Kamran Samakar
Obesity Surgery, 2024, 34 : 1987 - 1989
[38] The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education
Rizzo, Michael G.
Cai, Nathan
Constantinescu, David
JOURNAL OF ORTHOPAEDICS, 2024, 50 : 70 - 75
[39] Performance of GPT-4 on Chinese Nursing Examination
Miao, Yiqun
Luo, Yuan
Zhao, Yuhan
Li, Jiawei
Liu, Mingxuan
Wang, Huiying
Chen, Yuling
Wu, Ying
NURSE EDUCATOR, 2024, 49 (06) : E338 - E343
[40] Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard
Farhat, Faiza
Chaudhry, Beenish Moalla
Nadeem, Mohammad
Sohail, Shahab Saquib
Madsen, Dag Oivind
JMIR MEDICAL EDUCATION, 2024, 10

← 1 2 3 4 5 →