ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

被引:0
|
作者
Danehy, Tessa [1 ]
Hecht, Jessica [1 ]
Kentis, Sabrina [1 ]
Schechter, Clyde B. [2 ]
Jariwala, Sunit P. [3 ]
机构
[1] Albert Einstein Coll Med, Montefiore Med Ctr, Bronx, NY 10461 USA
[2] Albert Einstein Coll Med, Dept Family & Social Med, Bronx, NY USA
[3] Albert Einstein Coll Med, Div Allergy Immunol, Montefiore Med Ctr, Bronx, NY USA
来源
APPLIED CLINICAL INFORMATICS | 2024年 / 15卷 / 05期
关键词
ChatGPT; large language model; artificial intelligence; medical education; USMLE; ethics;
D O I
10.1055/a-2405-0138
中图分类号
R-058 [];
学科分类号
摘要
Objectives The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version. Methods Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation. Results Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points ( p < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points ( p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points ( p < 0.001) on medical ethics and 33% points ( p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response. Conclusion Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.
引用
收藏
页码:1049 / 1055
页数:7
相关论文
共 42 条
  • [1] Analyzing Question Characteristics Influencing ChatGPT's Performance in 3000 USMLE®-Style Questions
    Alfertshofer, Michael
    Knoedler, Samuel
    Hoch, Cosima C.
    Cotofana, Sebastian
    Panayi, Adriana C.
    Kauke-Navarro, Martin
    Tullius, Stefan G.
    Orgill, Dennis P.
    Austen, William G.
    Pomahac, Bohdan
    Knoedler, Leonard
    MEDICAL SCIENCE EDUCATOR, 2025, 35 (01) : 257 - 267
  • [2] Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis
    Knoedler, Leonard
    Alfertshofer, Michael
    Knoedler, Samuel
    Hoch, Cosima C.
    Funk, Paul F.
    Cotofana, Sebastian
    Maheta, Bhagvat
    Frank, Konstantin
    Brebant, Vanessa
    Prantl, Lukas
    Lamby, Philipp
    JMIR MEDICAL EDUCATION, 2024, 10
  • [3] ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: A Comparative Study Across Systems and Disciplines
    Razmig Garabet
    Brendan P. Mackey
    James Cross
    Michael Weingarten
    Medical Science Educator, 2024, 34 : 145 - 152
  • [4] ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: A Comparative Study Across Systems and Disciplines
    Garabet, Razmig
    Mackey, Brendan P.
    Cross, James
    Weingarten, Michael
    MEDICAL SCIENCE EDUCATOR, 2024, 34 (01) : 145 - 152
  • [5] Evaluation of ChatGPT pathology knowledge using board-style questions
    Geetha, Saroja D.
    Khan, Anam
    Khan, Atif
    Kannadath, Bijun S.
    Vitkovski, Taisia
    AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2024, 161 (04) : 393 - 398
  • [6] In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions
    Knoedler, Leonard
    Knoedler, Samuel
    Hoch, Cosima C.
    Prantl, Lukas
    Frank, Konstantin
    Soiderer, Laura
    Cotofana, Sebastian
    Dorafshar, Amir H.
    Schenck, Thilo
    Vollbach, Felix
    Sofo, Giuseppe
    Alfertshofer, Michael
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [7] Assessment of ChatGPT success with specialty medical knowledge using anaesthesiology board examination practice questions
    Shay, Denys
    Kumar, Bhawesh
    Bellamy, David
    Palepu, Anil
    Dershwitz, Mark
    Walz, Jens M.
    Schaefer, Maximilian S.
    Beam, Andrew
    BRITISH JOURNAL OF ANAESTHESIA, 2023, 131 (02)
  • [8] Evaluating the value of AI-generated questions for USMLE step 1 preparation: A study using ChatGPT-3.5
    Balu, Alan
    Prvulovic, Stefan T.
    Perez, Claudia Fernandez
    Kim, Alexander
    Donoho, Daniel A.
    Keating, Gregory
    MEDICAL TEACHER, 2025,
  • [9] Performance of ChatGPT on Factual Knowledge Questions Regarding Clinical Pharmacy
    van Nuland, Merel
    Erdogan, Abdullah
    Asar, Cenkay
    Contrucci, Ramon
    Hilbrants, Sven
    Maanach, Lamyae
    Egberts, Toine
    van der Linden, Paul D.
    JOURNAL OF CLINICAL PHARMACOLOGY, 2024, 64 (09) : 1095 - 1100
  • [10] Assessing the Knowledge of ChatGPT in Answering Questions Regarding Female Urology
    Cakir, Hakan
    Caglar, Ufuk
    Halis, Ahmet
    Sarilar, Omer
    Yazili, Huseyin Burak
    Ozgor, Faruk
    UROLOGY JOURNAL, 2024, 21 (06) : 410 - 414