Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination

被引:74
作者
Kung, Justin E. [1 ,2 ]
Marshall, Christopher [1 ,3 ]
Gauthier, Chase [1 ,2 ]
Gonzalez, Tyler A. [1 ,2 ]
Jackson III, J. Benjamin [1 ,2 ]
机构
[1] Prisma Hlth Midlands Univ South Carolina, Sch Med, Columbia, SC USA
[2] Prisma Hlth Midlands Univ South Carolina, Dept Orthoped Surg, Columbia, SC 29201 USA
[3] Univ South Carolina, Sch Med, Columbia, SC USA
关键词
D O I
10.2106/JBJS.OA.23.00056
中图分类号
R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学(修复外科学)];
学科分类号
摘要
Background:Artificial intelligence (AI) holds potential in improving medical education and healthcare delivery. ChatGPT is a state-of-the-art natural language processing AI model which has shown impressive capabilities, scoring in the top percentiles on numerous standardized examinations, including the Uniform Bar Exam and Scholastic Aptitude Test. The goal of this study was to evaluate ChatGPT performance on the Orthopaedic In-Training Examination (OITE), an assessment of medical knowledge for orthopedic residents.Methods:OITE 2020, 2021, and 2022 questions without images were inputted into ChatGPT version 3.5 and version 4 (GPT-4) with zero prompting. The performance of ChatGPT was evaluated as a percentage of correct responses and compared with the national average of orthopedic surgery residents at each postgraduate year (PGY) level. ChatGPT was asked to provide a source for its answer, which was categorized as being a journal article, book, or website, and if the source could be verified. Impact factor for the journal cited was also recorded.Results:ChatGPT answered 196 of 360 answers correctly (54.3%), corresponding to a PGY-1 level. ChatGPT cited a verifiable source in 47.2% of questions, with an average median journal impact factor of 5.4. GPT-4 answered 265 of 360 questions correctly (73.6%), corresponding to the average performance of a PGY-5 and exceeding the corresponding passing score for the American Board of Orthopaedic Surgery Part I Examination of 67%. GPT-4 cited a verifiable source in 87.9% of questions, with an average median journal impact factor of 5.2.Conclusions:ChatGPT performed above the average PGY-1 level and GPT-4 performed better than the average PGY-5 level, showing major improvement. Further investigation is needed to determine how successive versions of ChatGPT would perform and how to optimize this technology to improve medical education.Clinical Relevance:AI has the potential to aid in medical education and healthcare delivery.
引用
收藏
页数:5
相关论文
共 19 条
[1]  
aaos, ResStudy-Orthopaedic Exam Question Bank|American Academy of Orthopaedic Surgeons
[2]  
[Anonymous], Stanford CRFM
[3]  
[Anonymous], Orthopaedic In-Training Examination (OITE) Technical Report 2021
[4]  
cdn, GPT-4 technical report
[5]   How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment [J].
Gilson, Aidan ;
Safranek, Conrad W. ;
Huang, Thomas ;
Socrates, Vimig ;
Chi, Ling ;
Taylor, Richard Andrew ;
Chartash, David .
JMIR MEDICAL EDUCATION, 2023, 9
[6]   Performance of ChatGPT on the Plastic Surgery Inservice Training Examination [J].
Gupta, Rohun ;
Herzog, Isabel ;
Park, John B. ;
Weisberger, Joseph ;
Firouzbakht, Peter ;
Ocon, Vanessa ;
Chao, John ;
Lee, Edward S. ;
Mailey, Brian A. .
AESTHETIC SURGERY JOURNAL, 2023, :NP1078-NP1082
[7]   ChatGPT Is Equivalent to First-Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-service Examination [J].
Humar, Pooja ;
Asaad, Malke ;
Bengur, Fuat Baris ;
Nguyen, Vu .
AESTHETIC SURGERY JOURNAL, 2023, 43 (12) :NP1085-NP1089
[8]  
Incrocci M, Orthopaedic In-Training Examination (OITE) Technical Report 2020
[9]   GPT-4 passes the bar exam [J].
Katz, Daniel Martin ;
Bommarito, Michael James ;
Gao, Shang ;
Arredondo, Pablo .
PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2024, 382 (2270)
[10]   Automatic Hip Fracture Identification and Functional Subclassification with Deep Learning [J].
Krogue, Justin D. ;
Cheng, Kaiyang, V ;
Hwang, Kevin M. ;
Toogood, Paul ;
Meinberg, Eric G. ;
Geiger, Erik J. ;
Zaid, Musa ;
McGill, Kevin C. ;
Patel, Rina ;
Sohn, Jae Ho ;
Wright, Alexandra ;
Darger, Bryan F. ;
Padrez, Kevin A. ;
Ozhinsky, Eugene ;
Majumdar, Sharmila ;
Pedoia, Valentina .
RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2020, 2 (02)