Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination

被引:46
|
作者
Rosol, Maciej [1 ]
Gasior, Jakub S. [2 ]
Laba, Jonasz [1 ]
Korzeniewski, Kacper [1 ]
Mlynczak, Marcel [1 ]
机构
[1] Warsaw Univ Technol, Fac Mechatron, Inst Metrol & Biomed Engn, Boboli 8 St, PL-02525 Warsaw, Poland
[2] Med Univ Warsaw, Dept Pediat Cardiol & Gen Pediat, Warsaw, Poland
关键词
D O I
10.1038/s41598-023-46995-z
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The study aimed to evaluate the performance of two Large Language Models (LLMs): ChatGPT (based on GPT-3.5) and GPT-4 with two temperature parameter values, on the Polish Medical Final Examination (MFE). The models were tested on three editions of the MFE from: Spring 2022, Autumn 2022, and Spring 2023 in two language versions-English and Polish. The accuracies of both models were compared and the relationships between the correctness of answers with the answer's metrics were investigated. The study demonstrated that GPT-4 outperformed GPT-3.5 in all three examinations regardless of the language used. GPT-4 achieved mean accuracies of 79.7% for both Polish and English versions, passing all MFE versions. GPT-3.5 had mean accuracies of 54.8% for Polish and 60.3% for English, passing none and 2 of 3 Polish versions for temperature parameter equal to 0 and 1 respectively while passing all English versions regardless of the temperature parameter value. GPT-4 score was mostly lower than the average score of a medical student. There was a statistically significant correlation between the correctness of the answers and the index of difficulty for both models. The overall accuracy of both models was still suboptimal and worse than the average for medical students. This emphasizes the need for further improvements in LLMs before they can be reliably deployed in medical settings. These findings suggest an increasing potential for the usage of LLMs in terms of medical education.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
    Ballard, David H.
    RADIOLOGY, 2024, 311 (02)
  • [42] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
    Samaan, Jamil S.
    Rajeev, Nithya
    Ng, Wee Han
    Srinivasan, Nitin
    Busam, Jonathan A.
    Yeo, Yee Hui
    Samakar, Kamran
    OBESITY SURGERY, 2024, 34 (05) : 1987 - 1989
  • [43] Investigating the Perception of the Future in GPT-3,-3.5 and GPT-4
    Kozachek, Diana
    2023 PROCEEDINGS OF THE 15TH CONFERENCE ON CREATIVITY AND COGNITION, C&C 2023, 2023, : 282 - 287
  • [44] Assessing readability of explanations and reliability of answers by GPT-3.5 and GPT-4 in non-traumatic spinal cord injury education
    Garcia-Rudolph, Alejandro
    Sanchez-Pinsach, David
    Wright, Mark Andrew
    Opisso, Eloy
    Vidal, Joan
    MEDICAL TEACHER, 2024,
  • [45] Cognitive Network Science Reveals Bias in GPT-3, GPT-3.5 Turbo, and GPT-4 Mirroring Math Anxiety in High-School Students
    Abramski, Katherine
    Citraro, Salvatore
    Lombardi, Luigi
    Rossetti, Giulio
    Stella, Massimo
    BIG DATA AND COGNITIVE COMPUTING, 2023, 7 (03)
  • [46] Prompted Opinion Summarization with GPT-3.5
    Bhaskari, Adithya
    Fabbri, Alexander R.
    Durrett, Greg
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 9282 - 9300
  • [47] Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of artificial intelligence responses from GPT-3.5 and GPT-4
    Garcia-Rudolph, Alejandro
    Sanchez-Pinsach, David
    Opisso, Eloy
    Soler, Maria Dolors
    PAIN MEDICINE, 2024, 26 (01) : 48 - 50
  • [48] RE: Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of AI responses from GPT-3.5 and GPT-4
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    PAIN MEDICINE, 2024,
  • [49] Evaluating the GPT-3.5 and GPT-4 Large Language Models for Zero-Shot Classification of South African Violent Event Data
    Kotze, Eduan
    Senekal, Burgert A.
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, BIG DATA, COMPUTING AND DATA COMMUNICATION SYSTEMS, ICABCD 2024, 2024,
  • [50] RE: Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of AI responses from GPT-3.5 and GPT-4
    Garcia-Rudolph, Alejandro
    Sanchez-Pinsach, David
    Opisso, Eloy
    Soler, Maria Dolors
    PAIN MEDICINE, 2024,