ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

被引:0
|
作者
Danehy, Tessa [1 ]
Hecht, Jessica [1 ]
Kentis, Sabrina [1 ]
Schechter, Clyde B. [2 ]
Jariwala, Sunit P. [3 ]
机构
[1] Albert Einstein Coll Med, Montefiore Med Ctr, Bronx, NY 10461 USA
[2] Albert Einstein Coll Med, Dept Family & Social Med, Bronx, NY USA
[3] Albert Einstein Coll Med, Div Allergy Immunol, Montefiore Med Ctr, Bronx, NY USA
来源
APPLIED CLINICAL INFORMATICS | 2024年 / 15卷 / 05期
关键词
ChatGPT; large language model; artificial intelligence; medical education; USMLE; ethics;
D O I
10.1055/a-2405-0138
中图分类号
R-058 [];
学科分类号
摘要
Objectives The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version. Methods Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation. Results Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points ( p < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points ( p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points ( p < 0.001) on medical ethics and 33% points ( p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response. Conclusion Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.
引用
收藏
页码:1049 / 1055
页数:7
相关论文
共 42 条
  • [31] Large language models’ capabilities in responding to tuberculosis medical questions: testing ChatGPT, Gemini, and Copilot
    Meisam Dastani
    Jalal Mardaneh
    Morteza Rostamian
    Scientific Reports, 15 (1)
  • [32] May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients' questions? An evidence-controlled analysis
    Gravina, Antonietta Gerarda
    Pellegrino, Raffaele
    Cipullo, Marina
    Palladino, Giovanna
    Imperio, Giuseppe
    Ventura, Andrea
    Auletta, Salvatore
    Ciamarra, Paola
    Federico, Alessandro
    WORLD JOURNAL OF GASTROENTEROLOGY, 2024, 30 (01) : 17 - 33
  • [33] Exploring the Potential and Limitations of Chat Generative Pre-trained Transformer (ChatGPT) in Generating Board-Style Dermatology Questions: A Qualitative Analysis
    Ayub, Ibraheim
    Hamann, Dathan
    Hamann, Carsten R.
    Davis, Matthew J.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [34] Can ChatGPT Generate Acceptable Case-Based Multiple-Choice Questions for Medical School Anatomy Exams? A Pilot Study on Item Difficulty and Discrimination
    Kiyak, Yavuz Selim
    Soylu, Ayse
    Coskun, Ozlem
    Budakoglu, Isil Irem
    Peker, Tuncay Veysel
    CLINICAL ANATOMY, 2025, : 505 - 510
  • [35] Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation
    Gobira, Mauro
    Nakayama, Luis Filipe
    Moreira, Rodrigo
    Andrade, Eric
    Regatieri, Caio Vinicius Saito
    Belfort Jr, Rubens
    REVISTA DA ASSOCIACAO MEDICA BRASILEIRA, 2023, 69 (10):
  • [36] Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum
    Meyer, Annika
    Soleman, Ari
    Riese, Janik
    Streichert, Thomas
    CLINICAL CHEMISTRY AND LABORATORY MEDICINE, 2024, 62 (12) : 2425 - 2434
  • [37] Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions
    Erdat, Efe Cem
    Kavak, Engin Eren
    BMC CANCER, 2025, 25 (01)
  • [38] Assessing the performance of ChatGPT's responses to questions related to epilepsy: A cross-sectional study on natural language processing and medical information retrieval
    Kim, Hyun-Woo
    Shin, Dong-Hyeon
    Kim, Jiyoung
    Lee, Gha-Hyun
    Cho, Jae Wook
    SEIZURE-EUROPEAN JOURNAL OF EPILEPSY, 2024, 114 : 1 - 8
  • [39] GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions
    Guerra, Gage A.
    Hofmann, Hayden
    Sobhani, Sina
    Hofmann, Grady
    Gomez, David
    Soroudi, Daniel
    Hopkins, Benjamin S.
    Dallas, Jonathan
    Pangal, Dhiraj J.
    Cheok, Stephanie
    Nguyen, Vincent N.
    Mack, William J.
    Zada, Gabriel
    WORLD NEUROSURGERY, 2023, 179 : E160 - E165
  • [40] The Scientific Knowledge of Bard and ChatGPT in Endocrinology, Diabetes, and Diabetes Technology: Multiple-Choice Questions Examination-Based Performance
    Meo, Sultan Ayoub
    Al-Khlaiwi, Thamir
    Abukhalaf, Abdulelah Adnan
    Meo, Anusha Sultan
    Klonoff, David C.
    JOURNAL OF DIABETES SCIENCE AND TECHNOLOGY, 2023,