ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

被引:0
|
作者
Danehy, Tessa [1 ]
Hecht, Jessica [1 ]
Kentis, Sabrina [1 ]
Schechter, Clyde B. [2 ]
Jariwala, Sunit P. [3 ]
机构
[1] Albert Einstein Coll Med, Montefiore Med Ctr, Bronx, NY 10461 USA
[2] Albert Einstein Coll Med, Dept Family & Social Med, Bronx, NY USA
[3] Albert Einstein Coll Med, Div Allergy Immunol, Montefiore Med Ctr, Bronx, NY USA
来源
APPLIED CLINICAL INFORMATICS | 2024年 / 15卷 / 05期
关键词
ChatGPT; large language model; artificial intelligence; medical education; USMLE; ethics;
D O I
10.1055/a-2405-0138
中图分类号
R-058 [];
学科分类号
摘要
Objectives The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version. Methods Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation. Results Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points ( p < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points ( p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points ( p < 0.001) on medical ethics and 33% points ( p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response. Conclusion Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.
引用
收藏
页码:1049 / 1055
页数:7
相关论文
共 42 条
  • [21] Evaluating prompt engineering on GPT-3.5's performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4
    Patel, Dhavalkumar
    Raut, Ganesh
    Zimlichman, Eyal
    Cheetirala, Satya Narayan
    Nadkarni, Girish N.
    Glicksberg, Benjamin S.
    Apakama, Donald U.
    Bell, Elijah J.
    Freeman, Robert
    Timsina, Prem
    Klang, Eyal
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [22] Evaluating ChatGPT-3.5 and Claude-2 in Answering and Explaining Conceptual Medical Physiology Multiple-Choice Questions
    Agarwal, Mayank
    Goswami, Ayan
    Sharma, Priyanka
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (09)
  • [23] ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review
    Kiyak, Yavuz Selim
    Emekli, Emre
    POSTGRADUATE MEDICAL JOURNAL, 2024, 100 (1189) : 858 - 865
  • [24] The revised International Code of Medical Ethics: responses to some important questions
    Parsa-Parsi, Ramin W.
    Gillon, Raanan
    Wiesing, Urban
    JOURNAL OF MEDICAL ETHICS, 2024, 50 (03) : 179 - 180
  • [25] Digital twins running amok? Open questions for the ethics of an emerging medical technology
    Tigard, Daniel W.
    JOURNAL OF MEDICAL ETHICS, 2021, 47 (06) : 407 - 408
  • [26] Appropriateness of Frequently Asked Patient Questions Following Total Hip Arthroplasty From ChatGPT Compared to Arthroplasty-Trained Nurses
    Dubin, Jeremy A.
    Bains, Sandeep S.
    DeRogatis, Michael J.
    Moore, Mallory C.
    Hameed, Daniel
    Mont, Michael A.
    Nace, James
    Delanois, Ronald E.
    JOURNAL OF ARTHROPLASTY, 2024, 39 (09) : S306 - S311
  • [27] Evaluating the comprehension and accuracy of ChatGPT's responses to diabetes-related questions in Urdu compared to English
    Faisal, Seyreen
    Kamran, Tafiya Erum
    Khalid, Rimsha
    Haider, Zaira
    Siddiqui, Yusra
    Saeed, Nadia
    Imran, Sunaina
    Faisal, Romaan
    Jabeen, Misbah
    DIGITAL HEALTH, 2024, 10
  • [28] Can Students without Prior Knowledge Use ChatGPT to Answer Test Questions? An Empirical Study
    Shoufan, Abdulhadi
    ACM TRANSACTIONS ON COMPUTING EDUCATION, 2023, 23 (04)
  • [29] Assessment Study of ChatGPT-3.5's Performance on the Final Polish Medical Examination: Accuracy in Answering 980 Questions
    Siebielec, Julia
    Ordak, Michal
    Oskroba, Agata
    Dworakowska, Anna
    Bujalska-Zadrozny, Magdalena
    HEALTHCARE, 2024, 12 (16)
  • [30] Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology
    Agarwal, Mayank
    Sharma, Priyanka
    Goswami, Ayan
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (06)