ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam

被引：13

作者：

Fiedler, Benjamin ^{[1
]}

Azua, Eric N. ^{[1
]}

Phillips, Todd ^{[1
]}

Ahmed, Adil Shahzad ^{[1
]}

机构：

[1] Baylor Coll Med, Joseph Barnhart Dept Orthoped Surg, Houston, TX USA

来源：

JOURNAL OF SHOULDER AND ELBOW SURGERY | 2024年 / 33卷 / 09期

关键词：

ChatGPT; artificial intelligence (AI); maintenance of certification (MOC); machine learning; shoulder; elbow; American Shoulder and Elbow Surgeons (ASES);

D O I：

10.1016/j.jse.2024.02.029

中图分类号：

R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学（修复外科学）];

学科分类号：

摘要：

Background: While multiple studies have tested the ability of large language models (LLMs), such as ChatGPT, to pass standardized medical exams at different levels of training, LLMs have never been tested on surgical sub-specialty examinations, such as the American Shoulder and Elbow Surgeons (ASES) Maintenance of Certification (MOC). The purpose of this study was to compare results of ChatGPT 3.5, GPT-4, and fellowship-trained surgeons on the 2023 ASES MOC self-assessment exam. Methods: ChatGPT 3.5 and GPT-4 were subjected to the same set of text-only questions from the ASES MOC exam, and GPT-4 was additionally subjected to image-based MOC exam questions. Question responses from both models were compared against the correct answers. Performance of both models was compared to corresponding average human performance on the same question subsets. One sided proportional z-test were utilized to analyze data. Results: Humans performed significantly better than Chat GPT 3.5 on exclusively text-based questions (76.4% vs. 60.8%, P = .044). Humans also performed significantly better than GPT 4 on image-based questions (73.9% vs. 53.2%, P = .019). There was no significant difference between humans and GPT 4 in text-based questions (76.4% vs. 66.7%, P = .136). Accounting for all questions, humans significantly outperformed GPT-4 (75.3% vs. 60.2%, P = .012). GPT-4 did not perform statistically significantly betterer than ChatGPT 3.5 on text-only questions (66.7% vs. 60.8%, P = .268). Discussion: Although human performance was overall superior, ChatGPT demonstrated the capacity to analyze orthopedic information and answer specialty-specific questions on the ASES MOC exam for both text and image-based questions. With continued advancements in deep learning, LLMs may someday rival exam performance of fellowship-trained surgeons. (c) 2024 Journal of Shoulder and Elbow Surgery Board of Trustees. All rights reserved.

引用

页码：1888 / 1893

页数：6

共 34 条

[1] Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations [J].

Ali, Rohaid ;

Tang, Oliver Y. ;

Connolly, Ian D. ;

Sullivan, Patricia L. Zadnik ;

Shin, John H. ;

Fridley, Jared S. ;

Asaad, Wael F. ;

Cielo, Deus ;

Oyelese, Adetokunbo A. ;

Doberstein, Curtis E. ;

Gokaslan, Ziya L. ;

Telfeian, Albert E. .

NEUROSURGERY, 2023, 93 (06) :1353-1365

[2] COMPUTER VISION [J].

ALOIMONOS, Y ;

ROSENFELD, A .

SCIENCE, 1991, 253 (5025) :1249-1259

[3] Predicting Early Symptomatic Osteoarthritis in the Human Knee Using Machine Learning Classification of Magnetic Resonance Images From the Osteoarthritis Initiative [J].

Ashinsky, Beth G. ;

Bouhrara, Mustapha ;

Coletta, Christopher E. ;

Lehallier, Benoit ;

Urish, Kenneth L. ;

Lin, Ping-Chang ;

Goldberg, Ilya G. ;

Spencer, Richard G. .

JOURNAL OF ORTHOPAEDIC RESEARCH, 2017, 35 (10) :2243-2250

[4] Automated detection and classification of the proximal humerus fracture by using deep learning algorithm [J].

Chung, Seok Won ;

Han, Seung Seog ;

Lee, Ji Whan ;

Oh, Kyung-Soo ;

Kim, Na Ra ;

Yoon, Jong Pil ;

Kim, Joon Yub ;

Moon, Sung Hoon ;

Kwon, Jieun ;

Lee, Hyo-Jin ;

Noh, Young-Min ;

Kim, Youngjun .

ACTA ORTHOPAEDICA, 2018, 89 (04) :468-473

[5] Artificial intelligence vs. radiologist: accuracy of wrist fracture detection on radiographs [J].

Cohen, Mathieu ;

Puntonet, Julien ;

Sanchez, Julien ;

Kierszbaum, Elliott ;

Crema, Michel ;

Soyer, Philippe ;

Dion, Elisabeth .

EUROPEAN RADIOLOGY, 2023, 33 (06) :3974-3983

[6]

Daher Mohammad, 2023, JSES Int, V7, P2534, DOI 10.1016/j.jseint.2023.07.018

[7] ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations [J].

Dave, Tirth ;

Athaluri, Sai Anirudh ;

Singh, Satyam .

FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6

[8] How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment [J].

Gilson, Aidan ;

Safranek, Conrad W. ;

Huang, Thomas ;

Socrates, Vimig ;

Chi, Ling ;

Taylor, Richard Andrew ;

Chartash, David .

JMIR MEDICAL EDUCATION, 2023, 9

[9] Prompt Engineering with ChatGPT: A Guide for Academic Writers [J].

Giray, Louie .

ANNALS OF BIOMEDICAL ENGINEERING, 2023, 51 (12) :2629-2633

[10] Advancing Surgical Education: The Use of Artificial Intelligence in Surgical Training [J].

Guerrero, David T. ;

Asaad, Malke ;

Rajesh, Aashish ;

Hassan, Abbas ;

Butler, Charles E. .

AMERICAN SURGEON, 2023, 89 (01) :49-54

← 1 2 3 4 →