Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination

被引：46

作者：

Rosol, Maciej ^{[1
]}

Gasior, Jakub S. ^{[2
]}

Laba, Jonasz ^{[1
]}

Korzeniewski, Kacper ^{[1
]}

Mlynczak, Marcel ^{[1
]}

机构：

[1] Warsaw Univ Technol, Fac Mechatron, Inst Metrol & Biomed Engn, Boboli 8 St, PL-02525 Warsaw, Poland

[2] Med Univ Warsaw, Dept Pediat Cardiol & Gen Pediat, Warsaw, Poland

来源：

SCIENTIFIC REPORTS | 2023年 / 13卷 / 01期

关键词：

D O I：

10.1038/s41598-023-46995-z

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The study aimed to evaluate the performance of two Large Language Models (LLMs): ChatGPT (based on GPT-3.5) and GPT-4 with two temperature parameter values, on the Polish Medical Final Examination (MFE). The models were tested on three editions of the MFE from: Spring 2022, Autumn 2022, and Spring 2023 in two language versions-English and Polish. The accuracies of both models were compared and the relationships between the correctness of answers with the answer's metrics were investigated. The study demonstrated that GPT-4 outperformed GPT-3.5 in all three examinations regardless of the language used. GPT-4 achieved mean accuracies of 79.7% for both Polish and English versions, passing all MFE versions. GPT-3.5 had mean accuracies of 54.8% for Polish and 60.3% for English, passing none and 2 of 3 Polish versions for temperature parameter equal to 0 and 1 respectively while passing all English versions regardless of the temperature parameter value. GPT-4 score was mostly lower than the average score of a medical student. There was a statistically significant correlation between the correctness of the answers and the index of difficulty for both models. The overall accuracy of both models was still suboptimal and worse than the average for medical students. This emphasizes the need for further improvements in LLMs before they can be reliably deployed in medical settings. These findings suggest an increasing potential for the usage of LLMs in terms of medical education.

引用

页数：13

共 50 条

[1] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
Maciej Rosoł
Jakub S. Gąsior
Jonasz Łaba
Kacper Korzeniewski
Marcel Młyńczak
Scientific Reports, 13
[2] Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study
Takagi, Soshi
Watari, Takashi
Erabi, Ayano
Sakaguchi, Kota
JMIR MEDICAL EDUCATION, 2023, 9
[3] Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination
Dingyuan Luo
Mengke Liu
Runyuan Yu
Yulian Liu
Wenjun Jiang
Qi Fan
Naifeng Kuang
Qiang Gao
Tao Yin
Zuncheng Zheng
Scientific Reports, 15 (1)
[4] Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination
Kaneda, Yudai
Takahashi, Ryo
Kaneda, Uiri
Akashima, Shiori
Okita, Haruna
Misaki, Sadaya
Yamashiro, Akimi
Ozaki, Akihiko
Tanimoto, Tetsuya
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
[5] Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study
Jin, Hye Kyung
Kim, Eunyoung
JMIR MEDICAL EDUCATION, 2024, 10
[6] Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study
Meyer, Annika
Riese, Janik
Streichert, Thomas
JMIR MEDICAL EDUCATION, 2024, 10
[7] Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination
Lin, John C. C.
Younessi, David N. N.
Kurapati, Sai S. S.
Tang, Oliver Y. Y.
Scott, Ingrid U. U.
EYE, 2023, 37 (17) : 3694 - 3695
[8] Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination
John C. Lin
David N. Younessi
Sai S. Kurapati
Oliver Y. Tang
Ingrid U. Scott
Eye, 2023, 37 : 3694 - 3695
[9] The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study
Ohta, Keiichi
Ohta, Satomi
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (12)
[10] Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties
Luk, Dik Wai Anderson
Ip, Whitney Chin Tung
Shea, Yat-fung
JOURNAL OF THE CHINESE MEDICAL ASSOCIATION, 2024, 87 (03) : 259 - 260

← 1 2 3 4 5 →