Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination

被引：46

作者：

Rosol, Maciej ^{[1
]}

Gasior, Jakub S. ^{[2
]}

Laba, Jonasz ^{[1
]}

Korzeniewski, Kacper ^{[1
]}

Mlynczak, Marcel ^{[1
]}

机构：

[1] Warsaw Univ Technol, Fac Mechatron, Inst Metrol & Biomed Engn, Boboli 8 St, PL-02525 Warsaw, Poland

[2] Med Univ Warsaw, Dept Pediat Cardiol & Gen Pediat, Warsaw, Poland

来源：

SCIENTIFIC REPORTS | 2023年 / 13卷 / 01期

关键词：

D O I：

10.1038/s41598-023-46995-z

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The study aimed to evaluate the performance of two Large Language Models (LLMs): ChatGPT (based on GPT-3.5) and GPT-4 with two temperature parameter values, on the Polish Medical Final Examination (MFE). The models were tested on three editions of the MFE from: Spring 2022, Autumn 2022, and Spring 2023 in two language versions-English and Polish. The accuracies of both models were compared and the relationships between the correctness of answers with the answer's metrics were investigated. The study demonstrated that GPT-4 outperformed GPT-3.5 in all three examinations regardless of the language used. GPT-4 achieved mean accuracies of 79.7% for both Polish and English versions, passing all MFE versions. GPT-3.5 had mean accuracies of 54.8% for Polish and 60.3% for English, passing none and 2 of 3 Polish versions for temperature parameter equal to 0 and 1 respectively while passing all English versions regardless of the temperature parameter value. GPT-4 score was mostly lower than the average score of a medical student. There was a statistically significant correlation between the correctness of the answers and the index of difficulty for both models. The overall accuracy of both models was still suboptimal and worse than the average for medical students. This emphasizes the need for further improvements in LLMs before they can be reliably deployed in medical settings. These findings suggest an increasing potential for the usage of LLMs in terms of medical education.

引用

页数：13

共 50 条

[41] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
Ballard, David H.
RADIOLOGY, 2024, 311 (02)
[42] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
Samaan, Jamil S.
Rajeev, Nithya
Ng, Wee Han
Srinivasan, Nitin
Busam, Jonathan A.
Yeo, Yee Hui
Samakar, Kamran
OBESITY SURGERY, 2024, 34 (05) : 1987 - 1989
[43] Investigating the Perception of the Future in GPT-3,-3.5 and GPT-4
Kozachek, Diana
2023 PROCEEDINGS OF THE 15TH CONFERENCE ON CREATIVITY AND COGNITION, C&C 2023, 2023, : 282 - 287
[44] Assessing readability of explanations and reliability of answers by GPT-3.5 and GPT-4 in non-traumatic spinal cord injury education
Garcia-Rudolph, Alejandro
Sanchez-Pinsach, David
Wright, Mark Andrew
Opisso, Eloy
Vidal, Joan
MEDICAL TEACHER, 2024,
[45] Cognitive Network Science Reveals Bias in GPT-3, GPT-3.5 Turbo, and GPT-4 Mirroring Math Anxiety in High-School Students
Abramski, Katherine
Citraro, Salvatore
Lombardi, Luigi
Rossetti, Giulio
Stella, Massimo
BIG DATA AND COGNITIVE COMPUTING, 2023, 7 (03)
[46] Prompted Opinion Summarization with GPT-3.5
Bhaskari, Adithya
Fabbri, Alexander R.
Durrett, Greg
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 9282 - 9300
[47] Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of artificial intelligence responses from GPT-3.5 and GPT-4
Garcia-Rudolph, Alejandro
Sanchez-Pinsach, David
Opisso, Eloy
Soler, Maria Dolors
PAIN MEDICINE, 2024, 26 (01) : 48 - 50
[48] RE: Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of AI responses from GPT-3.5 and GPT-4
Daungsupawong, Hinpetch
Wiwanitkit, Viroj
PAIN MEDICINE, 2024,
[49] Evaluating the GPT-3.5 and GPT-4 Large Language Models for Zero-Shot Classification of South African Violent Event Data
Kotze, Eduan
Senekal, Burgert A.
2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, BIG DATA, COMPUTING AND DATA COMMUNICATION SYSTEMS, ICABCD 2024, 2024,
[50] RE: Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of AI responses from GPT-3.5 and GPT-4
Garcia-Rudolph, Alejandro
Sanchez-Pinsach, David
Opisso, Eloy
Soler, Maria Dolors
PAIN MEDICINE, 2024,

← 1 2 3 4 5 →