Analyzing Question Characteristics Influencing ChatGPT's Performance in 3000 USMLE®-Style Questions

被引:0
作者
Alfertshofer, Michael [1 ,11 ]
Knoedler, Samuel [2 ,3 ]
Hoch, Cosima C. [4 ]
Cotofana, Sebastian [5 ,6 ]
Panayi, Adriana C. [2 ,7 ]
Kauke-Navarro, Martin [3 ]
Tullius, Stefan G. [8 ]
Orgill, Dennis P. [2 ]
Austen, William G. [9 ]
Pomahac, Bohdan [3 ]
Knoedler, Leonard [3 ,9 ,10 ]
机构
[1] Ludwig Maximilians Univ Munchen, Dept Oral & Maxillofacial Surg, Munich, Germany
[2] Harvard Med Sch, Brigham & Womens Hosp, Div Plast Surg, Boston, MA USA
[3] Yale New Haven Hosp, Yale Sch Med, Div Plast Surg, New Haven, CT USA
[4] Tech Univ Munich, Dept Otolaryngol Head & Neck Surg, Munich, Germany
[5] Erasmus Hosp, Dept Dermatol, Rotterdam, Netherlands
[6] Queen Mary Univ London, Blizard Inst, Ctr Cutaneous Res, London, England
[7] Heidelberg Univ, Burn Trauma Ctr, BG Trauma Ctr Ludwigshafen, Dept Hand Plast & Reconstruct Surg Microsurg, Ludwigshafen, Germany
[8] Harvard Med Sch, Brigham & Womens Hosp, Div Transplant Surg, Boston, MA USA
[9] Harvard Med Sch, Massachusetts Gen Hosp, Div Plast & Reconstruct Surg, Boston, MA USA
[10] Univ Hosp Regensburg, Dept Plast Hand & Reconstruct Surg, Regensburg, Germany
[11] Tech Univ Munich, Dept Plast Surg & Hand Surg, Klinikum Rechts Isar, Munich, Germany
关键词
Medical education; Artificial intelligence; ChatGPT; USMLE; Quiz;
D O I
10.1007/s40670-024-02176-9
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background The potential of artificial intelligence (AI) and large language models like ChatGPT in medical applications is promising, yet its performance requires comprehensive evaluation. This study assessed ChatGPT's capabilities in answering USMLE (R) Step 2CK questions, analyzing its performance across medical specialties, question types, and difficulty levels in a large-scale question test set to assist question writers in developing AI-resistant exam questions and provide medical students with a realistic understanding of how AI can enhance their active learning. Materials and Methods A total of n=3302 USMLE (R) Step 2CK practice questions were extracted from the AMBOSS (c) study platform, excluding 302 image-based questions, leaving 3000 text-based questions for analysis. Questions were manually entered into ChatGPT and its accuracy and performance across various categories and difficulties were evaluated. Results ChatGPT answered 57.7% of all questions correctly. Highest performance scores were found in the category "Male Reproductive System" (71.7%) while the lowest were found in the category "Immune System" (46.3%). Lower performance was noted in table-based questions, and a negative correlation was found between question difficulty and performance (r(s)=-0.285, p <0.001). Longer questions tended to be answered incorrectly more often (r(s)=-0.076, p <0.001), with a significant difference in length of correctly versus incorrectly answered questions. Conclusion ChatGPT demonstrated proficiency close to the passing threshold for USMLE (R) Step 2CK. Performance varied by category, question type, and difficulty. These findings aid medical educators make their exams more AI-proof and inform the integration of AI tools like ChatGPT into teaching strategies. For students, understanding the model's limitations and capabilities ensures it is used as an auxiliary resource to foster active learning rather than abusing it as a study replacement. This study highlights the need for further refinement and improvement in AI models for medical education and decision-making.
引用
收藏
页码:257 / 267
页数:11
相关论文
共 22 条
  • [1] Amboss, USMLE STEP 2 CK PREP
  • [2] Overview of artificial intelligence in medicine
    Amisha
    Malik, Paras
    Pathania, Monika
    Rathaur, Vyas Kumar
    [J]. JOURNAL OF FAMILY MEDICINE AND PRIMARY CARE, 2019, 8 (07) : 2328 - 2331
  • [3] Bajwa Junaid, 2021, Future Healthc J, V8, pe188, DOI 10.7861/fhj.2021-0095
  • [4] Artificial intelligence in information systems research: A systematic literature review and research agenda
    Collins, Christopher
    Dennehy, Denis
    Conboy, Kieran
    Mikalef, Patrick
    [J]. INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT, 2021, 60
  • [5] Duarte F., 2024, Exploding topics
  • [6] Evaluating the authenticity of ChatGPT responses: a study on text-matching capabilities
    Elkhatat, Ahmed M.
    [J]. INTERNATIONAL JOURNAL FOR EDUCATIONAL INTEGRITY, 2023, 19 (01)
  • [7] ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions
    Funk, Paul F.
    Hoch, Cosima C.
    Knoedler, Samuel
    Knoedler, Leonard
    Cotofana, Sebastian
    Sofo, Giuseppe
    Bashiri Dezfouli, Ali
    Wollenberg, Barbara
    Guntinas-Lichius, Orlando
    Alfertshofer, Michael
    [J]. EUROPEAN JOURNAL OF INVESTIGATION IN HEALTH PSYCHOLOGY AND EDUCATION, 2024, 14 (03) : 657 - 668
  • [8] Gilson Aidan, 2023, JMIR Med Educ, V9, pe45312, DOI 10.2196/45312
  • [9] Effect of Change in USMLE Step 1 Grading on Orthopaedic Surgery Applicants: A Survey of Orthopaedic Surgery Residency Program Directors
    Gu, Alex
    Farrar, Jacob
    Fassihi, Safa C.
    Stake, Seth
    Ramamurti, Pradip
    Wei, Chapman
    Wessel, Lauren E.
    Fufa, Duretti T.
    Rao, Raj D.
    [J]. JOURNAL OF THE AMERICAN ACADEMY OF ORTHOPAEDIC SURGEONS GLOBAL RESEARCH AND REVIEWS, 2021, 5 (05): : E2000216
  • [10] Hartman Nicholas D, 2019, J Grad Med Educ, V11, P268, DOI 10.4300/JGME-D-18-00979.3