Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations

被引:93
作者
Massey, Patrick A. [1 ]
Montgomery, Carver [1 ]
Zhang, Andrew S. [1 ]
机构
[1] Louisiana State Univ, Dept Orthopaed Surg, Hlth Sci Ctr Shreveport, Shreveport, LA 71103 USA
关键词
D O I
10.5435/JAAOS-D-23-00396
中图分类号
R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学(修复外科学)];
学科分类号
摘要
Introduction: Artificial intelligence (AI) programs have the ability to answer complex queries including medical profession examination questions. The purpose of this study was to compare the performance of orthopaedic residents (ortho residents) against Chat Generative Pretrained Transformer (ChatGPT)-3.5 and GPT-4 on orthopaedic assessment examinations. A secondary objective was to perform a subgroup analysis comparing the performance of each group on questions that included image interpretation versus text-only questions.Methods: The ResStudy orthopaedic examination question bank was used as the primary source of questions. One hundred eighty questions and answer choices from nine different orthopaedic subspecialties were directly input into ChatGPT-3.5 and then GPT-4. ChatGPT did not have consistently available image interpretation, so no images were directly provided to either AI format. Answers were recorded as correct versus incorrect by the chatbot, and resident performance was recorded based on user data provided by ResStudy.Results: Overall, ChatGPT-3.5, GPT-4, and ortho residents scored 29.4%, 47.2%, and 74.2%, respectively. There was a difference among the three groups in testing success, with ortho residents scoring higher than ChatGPT-3.5 and GPT-4 ( P < 0.001 and P < 0.001). GPT-4 scored higher than ChatGPT-3.5 ( P = 0.002). A subgroup analysis was performed by dividing questions into question stems without images and question stems with images. ChatGPT-3.5 was more correct (37.8% vs. 22.4%, respectively, OR = 2.1, P = 0.033) and ChatGPT-4 was also more correct (61.0% vs. 35.7%, OR = 2.8, P < 0.001), when comparing text-only questions versus questions with images. Residents were 72.6% versus 75.5% correct with text-only questions versus questions with images, with no significant difference ( P = 0.302).Conclusion: Orthopaedic residents were able to answer more questions accurately than ChatGPT-3.5 and GPT-4 on orthopaedic assessment examinations. GPT-4 is superior to ChatGPT-3.5 for answering orthopaedic resident assessment examination questions. Both ChatGPT-3.5 and GPT-4 performed better on text-only questions than questions with images. It is unlikely that GPT-4 or ChatGPT-3.5 would pass the American Board of Orthopaedic Surgery written examination.
引用
收藏
页码:1173 / 1179
页数:7
相关论文
共 21 条
[1]   Chatbots: History, technology, and applications [J].
Adamopoulou, Eleni ;
Moussiades, Lefteris .
MACHINE LEARNING WITH APPLICATIONS, 2020, 2
[2]  
Ali Rohaid, 2023, Neurosurgery, V93, P1090, DOI 10.1227/neu.0000000000002551
[3]  
[Anonymous], Where does ChatGPT get its information from?
[4]   ChatGPT and the Future of Medical Writing [J].
Biswas, Som .
RADIOLOGY, 2023, 307 (02)
[5]  
Exam Stats, ABOS
[6]   How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment [J].
Gilson, Aidan ;
Safranek, Conrad W. ;
Huang, Thomas ;
Socrates, Vimig ;
Chi, Ling ;
Taylor, Richard Andrew ;
Chartash, David .
JMIR MEDICAL EDUCATION, 2023, 9
[7]   The appropriation of conversational AI in the workplace: A taxonomy of AI chatbot users [J].
Gkinko, Lorentsa ;
Elbanna, Amany .
INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT, 2023, 69
[8]   Chatbots, Humbots, and the Quest for Artificial General Intelligence [J].
Grudin, Jonathan ;
Jacques, Richard .
CHI 2019: PROCEEDINGS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, 2019,
[9]  
Hu K., Reuters Web site
[10]  
Incrocci M., 2021, Orthopaedic In-Training Examination (OITE) Technical Report