Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

被引:18
作者
Knoedler, Leonard [1 ,10 ]
Alfertshofer, Michael [2 ]
Knoedler, Samuel [1 ,3 ]
Hoch, Cosima C. [4 ]
Funk, Paul F. [5 ]
Cotofana, Sebastian [6 ,7 ]
Maheta, Bhagvat [8 ]
Frank, Konstantin [9 ]
Brebant, Vanessa [1 ]
Prantl, Lukas [1 ]
Lamby, Philipp [1 ]
机构
[1] Univ Hosp Regensburg, Dept Plast Hand & Reconstruct Surg, Regensburg, Germany
[2] Ludwig Maximilians Univ Munchen, Div Hand Plast & Aesthet Surg, Munich, Germany
[3] Harvard Med Sch, Brigham & Womens Hosp, Div Plast Surg, Boston, MA 02115 USA
[4] Tech Univ Munich TUM, Sch Med, Dept Otolaryngol Head & Neck Surg, Munich, Germany
[5] Friedrich Schiller Univ Jena, Univ Hosp Jena, Dept Otolaryngol Head & Neck Surg, Jena, Germany
[6] Erasmus Univ Hosp, Dept Dermatol, Rotterdam, Netherlands
[7] Queen Mary Univ London, Blizard Inst, Ctr Cutaneous Res, London, England
[8] Calif Northstate Univ, Coll Med, Elk Grove, CA USA
[9] Ocean Clin, Marbella, Spain
[10] Univ Hosp Regensburg, Dept Plast Hand & Reconstruct Surg, Franz Josef Str Allee 11, D-93053 Regensburg, Germany
来源
JMIR MEDICAL EDUCATION | 2024年 / 10卷
关键词
ChatGPT; United States Medical Licensing Examination; artificial intelligence; USMLE; USMLE Step 1; OpenAI; medical education; clinical decision-making;
D O I
10.2196/51148
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student's knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT's performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited.Objective: This paper aimed to analyze ChatGPT's performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating.Methods: A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions.Results: Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (rho=-0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with rho=-0.289 for ChatGPT 3.5 and rho=-0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached.Conclusions: In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.
引用
收藏
页数:10
相关论文
共 18 条
[1]  
amboss.com/, AMBOSS question bank
[2]   The opportunities and pitfalls of ChatGPT in clinical and translational medicine [J].
Baumgartner, Christian .
CLINICAL AND TRANSLATIONAL MEDICINE, 2023, 13 (03)
[3]   Study Behaviors and USMLE Step 1 Performance: Implications of a Student Self-Directed Parallel Curriculum [J].
Burk-Rafel, Jesse ;
Santen, Sally A. ;
Purkiss, Joel .
ACADEMIC MEDICINE, 2017, 92 (11) :S67-S74
[4]   Medical Students' Reflections on the Recent Changes to the USMLE Step Exams [J].
Cangialosi, Peter T. ;
Chung, Brian C. ;
Thielhelm, Torin P. ;
Camarda, Nicholas D. ;
Eiger, Dylan S. .
ACADEMIC MEDICINE, 2021, 96 (03) :343-348
[5]   Artificial Intelligence-Enabled Evaluation of Pain Sketches to Predict Outcomes in Headache Surgery [J].
Chartier, Christian ;
Gfrerer, Lisa ;
Knoedler, Leonard ;
Austen, William G., Jr. .
PLASTIC AND RECONSTRUCTIVE SURGERY, 2023, 151 (02) :405-411
[6]   The association of USMLE Step 1 and Step 2 CK scores with residency match specialty and location [J].
Gauer, Jacqueline L. ;
Jackson, J. Brooks .
MEDICAL EDUCATION ONLINE, 2017, 22
[7]  
Gilson Aidan, 2023, JMIR Med Educ, V9, pe45312, DOI 10.2196/45312
[8]   ChatGPT's quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions [J].
Hoch, Cosima C. ;
Wollenberg, Barbara ;
Lueers, Jan-Christoffer ;
Knoedler, Samuel ;
Knoedler, Leonard ;
Frank, Konstantin ;
Cotofana, Sebastian ;
Alfertshofer, Michael .
EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, 2023, 280 (09) :4271-4278
[9]   ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board-style questions [J].
Hopkins, Benjamin S. ;
Nguyen, Vincent N. ;
Dallas, Jonathan ;
Texakalidis, Pavlos ;
Yang, Max ;
Renn, Alex ;
Guerra, Gage ;
Kashif, Zain ;
Cheok, Stephanie ;
Zada, Gabriel ;
Mack, William J. .
JOURNAL OF NEUROSURGERY, 2023, 139 (03) :904-911
[10]   Artificial intelligence-enabled simulation of gluteal augmentation: A helpful tool in preoperative outcome simulation? [J].
Knoedler, Leonard ;
Odenthal, Jan ;
Prantl, Lukas ;
Oezdemir, Berkin ;
Kehrer, Andreas ;
Kauke-Navarro, Martin ;
Matar, Dany Y. ;
Obed, Doha ;
Panayi, Adriana C. ;
Broer, P. Niclas ;
Chartier, Christian ;
Knoedler, Samuel .
JOURNAL OF PLASTIC RECONSTRUCTIVE AND AESTHETIC SURGERY, 2023, 80 :94-101