Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations

被引：113

作者：

Ali, Rohaid ^{[1
,6
]}

Tang, Oliver Y. ^{[1
]}

Connolly, Ian D. ^{[2
]}

Sullivan, Patricia L. Zadnik ^{[1
]}

Shin, John H. ^{[3
]}

Fridley, Jared S. ^{[1
]}

Asaad, Wael F. ^{[1
,3
,4
,5
]}

Cielo, Deus ^{[1
]}

Oyelese, Adetokunbo A. ^{[1
]}

Doberstein, Curtis E. ^{[1
]}

Gokaslan, Ziya L. ^{[1
]}

Telfeian, Albert E. ^{[1
]}

机构：

[1] USA, Blountstown, FL USA

[2] Massachusetts Gen Hosp, Dept Neurosurg, Boston, MA USA

[3] Rhode Isl Hosp, Norman Prince Neurosci Inst, Dept Neurosci, Providence, RI 02903 USA

[4] Brown Univ, Dept Neurosci, Providence, RI USA

[5] Brown Univ, Carney Inst Brain Sci, Dept Neurosci, Providence, RI USA

[6] Rhode Isl Hosp, Dept Neurosurg, LPG Neurosurg, 593 Eddy St,APC6, Providence, RI 02903 USA

来源：

NEUROSURGERY | 2023年 / 93卷 / 06期

关键词：

Neurosurgery; Medical education; Surgical education; Residency education; Artificial intelligence; Large language models; ChatGPT; GPT-4;

D O I：

10.1227/neu.0000000000002632

中图分类号：

R74 [神经病学与精神病学];

学科分类号：

摘要：

BACKGROUND AND OBJECTIVES: Interest surrounding generative large language models (LLMs) has rapidly grown. Although ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized examinations and the factors affecting accuracy remain unclear. This study aims to assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written board examination.METHODS: The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. chi 2, Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics.RESULTS: ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent (P = .963), GPT-4 outperformed both (both P < .001). GPT-4 answered every question answered correctly by ChatGPT and 37.6% (50/133) of remaining incorrect questions correctly. Among 12 question categories, GPT-4 significantly outperformed users in each but performed comparably with ChatGPT in 3 (functional, other general, and spine) and outperformed both users and ChatGPT for tumor questions. Increased word count (odds ratio = 0.89 of answering a question correctly per +10 words) and higher-order problem-solving (odds ratio = 0.40, P = .009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (both P > .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone.CONCLUSION: LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.

引用

页码：1353 / 1365

页数：13

共 14 条

[1]

Ali R., Neurosurgery, DOI [10.1227/neu.0000000000002618, DOI 10.1227/NEU.0000000000002618]

[2] Study Behaviors and USMLE Step 1 Performance: Implications of a Student Self-Directed Parallel Curriculum [J].

Burk-Rafel, Jesse ;

Santen, Sally A. ;

Purkiss, Joel .

ACADEMIC MEDICINE, 2017, 92 (11) :S67-S74

[3] How to develop machine learning models for healthcare [J].

Chen, Po-Hsuan Cameron ;

Liu, Yun ;

Peng, Lily .

NATURE MATERIALS, 2019, 18 (05) :410-414

[4]

Gupta A, 2023, A Responsible Path to Generative AI in Healthcare

[5] Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models [J].

Kung, Tiffany H. ;

Cheatham, Morgan ;

Medenilla, Arielle ;

Sillos, Czarina ;

De Leon, Lorie ;

Elepano, Camille ;

Madriaga, Maria ;

Aggabao, Rimel ;

Diaz-Candido, Giezel ;

Maningo, James ;

Tseng, Victor .

PLOS DIGITAL HEALTH, 2023, 2 (02)

[6] A deep learning system for differential diagnosis of skin diseases [J].

Liu, Yuan ;

Jain, Ayush ;

Eng, Clara ;

Way, David H. ;

Lee, Kang ;

Bui, Peggy ;

Kanada, Kimberly ;

de Oliveira Marinho, Guilherme ;

Gallegos, Jessica ;

Gabriele, Sara ;

Gupta, Vishakha ;

Singh, Nalini ;

Natarajan, Vivek ;

Hofmann-Wellenhof, Rainer ;

Corrado, Greg S. ;

Peng, Lily H. ;

Webster, Dale R. ;

Ai, Dennis ;

Huang, Susan J. ;

Liu, Yun ;

Dunn, R. Carter ;

Coz, David .

NATURE MEDICINE, 2020, 26 (06) :900-+

[7]

Martinez E., 2023, SSRN Electron J, P410

[8]

Moran S., 2020, How to Prepare for the USMLE Step 1

[9]

Nori H., 2023, ARXIV

[10]

Nori Harsha, 2023, Capabilities of gpt-4 on medical challenge problems

← 1 2 →