ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

被引:0
|
作者
Danehy, Tessa [1 ]
Hecht, Jessica [1 ]
Kentis, Sabrina [1 ]
Schechter, Clyde B. [2 ]
Jariwala, Sunit P. [3 ]
机构
[1] Albert Einstein Coll Med, Montefiore Med Ctr, Bronx, NY 10461 USA
[2] Albert Einstein Coll Med, Dept Family & Social Med, Bronx, NY USA
[3] Albert Einstein Coll Med, Div Allergy Immunol, Montefiore Med Ctr, Bronx, NY USA
来源
APPLIED CLINICAL INFORMATICS | 2024年 / 15卷 / 05期
关键词
ChatGPT; large language model; artificial intelligence; medical education; USMLE; ethics;
D O I
10.1055/a-2405-0138
中图分类号
R-058 [];
学科分类号
摘要
Objectives The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version. Methods Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation. Results Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points ( p < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points ( p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points ( p < 0.001) on medical ethics and 33% points ( p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response. Conclusion Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.
引用
收藏
页码:1049 / 1055
页数:7
相关论文
共 42 条
  • [11] Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study
    Yanagita, Yasutaka
    Yokokawa, Daiki
    Uchida, Shun
    Tawara, Junsuke
    Ikusaka, Masatomi
    JMIR FORMATIVE RESEARCH, 2023, 7
  • [12] Knowledge development, technology and questions of nursing ethics
    Peirce, Anne Griswold
    Elie, Suzanne
    George, Annie
    Gold, Mariya
    O'Hara, Kim
    Rose-Facey, Wendella
    NURSING ETHICS, 2020, 27 (01) : 77 - 87
  • [13] ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions
    Funk, Paul F.
    Hoch, Cosima C.
    Knoedler, Samuel
    Knoedler, Leonard
    Cotofana, Sebastian
    Sofo, Giuseppe
    Bashiri Dezfouli, Ali
    Wollenberg, Barbara
    Guntinas-Lichius, Orlando
    Alfertshofer, Michael
    EUROPEAN JOURNAL OF INVESTIGATION IN HEALTH PSYCHOLOGY AND EDUCATION, 2024, 14 (03) : 657 - 668
  • [14] Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4
    Kim, Sung Eun
    Lee, Ji Han
    Choi, Byung Sun
    Han, Hyuk-Soo
    Lee, Myung Chul
    Ro, Du Hyun
    CLINICS IN ORTHOPEDIC SURGERY, 2024, 16 (04) : 669 - 673
  • [15] Medical knowledge of ChatGPT in public health, infectious diseases, COVID-19 pandemic, and vaccines: multiple choice questions examination based performance
    Meo, Sultan Ayoub
    Alotaibi, Metib
    Meo, Muhammad Zain Sultan
    Meo, Muhammad Omair Sultan
    Hamid, Mashhood
    FRONTIERS IN PUBLIC HEALTH, 2024, 12
  • [16] Assessing the Capability of ChatGPT in Answering First- and Second-Order Knowledge Questions on Microbiology as per Competency- Based Medical Education Curriculum
    Das, Dipmala
    Kumar, Nikhil
    Longjam, Langamba Angom
    Sinha, Ranwir
    Roy, Asitava Deb
    Mondal, Himel
    Gupta, Pratima
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (03)
  • [17] Can Patients Rely on ChatGPT to Answer Hand Pathology-Related Medical Questions?
    Jagiella-Lodise, Olivia
    Suh, Nina
    Zelenski, Nicole A.
    HAND-AMERICAN ASSOCIATION FOR HAND SURGERY, 2024,
  • [18] Evaluating ChatGPT's Ability to Solve Higher-Order Questions on the Competency-Based Medical Education Curriculum in Medical Biochemistry
    Ghosh, Arindam
    Bir, Aritri
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (04)
  • [19] Assessing the knowledge of ChatGPT and Google Gemini in answering peripheral artery disease-related questions
    Cetin, Hakki Kursat
    Demir, Tolga
    VASCULAR, 2025,
  • [20] Presentation suitability and readability of ChatGPT's medical responses to patient questions about on knee osteoarthritis
    Yoo, Myungeun
    Jang, Chan Woong
    HEALTH INFORMATICS JOURNAL, 2025, 31 (01)