Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination

被引:46
|
作者
Rosol, Maciej [1 ]
Gasior, Jakub S. [2 ]
Laba, Jonasz [1 ]
Korzeniewski, Kacper [1 ]
Mlynczak, Marcel [1 ]
机构
[1] Warsaw Univ Technol, Fac Mechatron, Inst Metrol & Biomed Engn, Boboli 8 St, PL-02525 Warsaw, Poland
[2] Med Univ Warsaw, Dept Pediat Cardiol & Gen Pediat, Warsaw, Poland
关键词
D O I
10.1038/s41598-023-46995-z
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The study aimed to evaluate the performance of two Large Language Models (LLMs): ChatGPT (based on GPT-3.5) and GPT-4 with two temperature parameter values, on the Polish Medical Final Examination (MFE). The models were tested on three editions of the MFE from: Spring 2022, Autumn 2022, and Spring 2023 in two language versions-English and Polish. The accuracies of both models were compared and the relationships between the correctness of answers with the answer's metrics were investigated. The study demonstrated that GPT-4 outperformed GPT-3.5 in all three examinations regardless of the language used. GPT-4 achieved mean accuracies of 79.7% for both Polish and English versions, passing all MFE versions. GPT-3.5 had mean accuracies of 54.8% for Polish and 60.3% for English, passing none and 2 of 3 Polish versions for temperature parameter equal to 0 and 1 respectively while passing all English versions regardless of the temperature parameter value. GPT-4 score was mostly lower than the average score of a medical student. There was a statistically significant correlation between the correctness of the answers and the index of difficulty for both models. The overall accuracy of both models was still suboptimal and worse than the average for medical students. This emphasizes the need for further improvements in LLMs before they can be reliably deployed in medical settings. These findings suggest an increasing potential for the usage of LLMs in terms of medical education.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Maciej Rosoł
    Jakub S. Gąsior
    Jonasz Łaba
    Kacper Korzeniewski
    Marcel Młyńczak
    Scientific Reports, 13
  • [2] Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study
    Takagi, Soshi
    Watari, Takashi
    Erabi, Ayano
    Sakaguchi, Kota
    JMIR MEDICAL EDUCATION, 2023, 9
  • [3] Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination
    Dingyuan Luo
    Mengke Liu
    Runyuan Yu
    Yulian Liu
    Wenjun Jiang
    Qi Fan
    Naifeng Kuang
    Qiang Gao
    Tao Yin
    Zuncheng Zheng
    Scientific Reports, 15 (1)
  • [4] Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination
    Kaneda, Yudai
    Takahashi, Ryo
    Kaneda, Uiri
    Akashima, Shiori
    Okita, Haruna
    Misaki, Sadaya
    Yamashiro, Akimi
    Ozaki, Akihiko
    Tanimoto, Tetsuya
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [5] Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study
    Jin, Hye Kyung
    Kim, Eunyoung
    JMIR MEDICAL EDUCATION, 2024, 10
  • [6] Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study
    Meyer, Annika
    Riese, Janik
    Streichert, Thomas
    JMIR MEDICAL EDUCATION, 2024, 10
  • [7] Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination
    Lin, John C. C.
    Younessi, David N. N.
    Kurapati, Sai S. S.
    Tang, Oliver Y. Y.
    Scott, Ingrid U. U.
    EYE, 2023, 37 (17) : 3694 - 3695
  • [8] Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination
    John C. Lin
    David N. Younessi
    Sai S. Kurapati
    Oliver Y. Tang
    Ingrid U. Scott
    Eye, 2023, 37 : 3694 - 3695
  • [9] The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study
    Ohta, Keiichi
    Ohta, Satomi
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (12)
  • [10] Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties
    Luk, Dik Wai Anderson
    Ip, Whitney Chin Tung
    Shea, Yat-fung
    JOURNAL OF THE CHINESE MEDICAL ASSOCIATION, 2024, 87 (03) : 259 - 260