Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations

被引:113
作者
Ali, Rohaid [1 ,6 ]
Tang, Oliver Y. [1 ]
Connolly, Ian D. [2 ]
Sullivan, Patricia L. Zadnik [1 ]
Shin, John H. [3 ]
Fridley, Jared S. [1 ]
Asaad, Wael F. [1 ,3 ,4 ,5 ]
Cielo, Deus [1 ]
Oyelese, Adetokunbo A. [1 ]
Doberstein, Curtis E. [1 ]
Gokaslan, Ziya L. [1 ]
Telfeian, Albert E. [1 ]
机构
[1] USA, Blountstown, FL USA
[2] Massachusetts Gen Hosp, Dept Neurosurg, Boston, MA USA
[3] Rhode Isl Hosp, Norman Prince Neurosci Inst, Dept Neurosci, Providence, RI 02903 USA
[4] Brown Univ, Dept Neurosci, Providence, RI USA
[5] Brown Univ, Carney Inst Brain Sci, Dept Neurosci, Providence, RI USA
[6] Rhode Isl Hosp, Dept Neurosurg, LPG Neurosurg, 593 Eddy St,APC6, Providence, RI 02903 USA
关键词
Neurosurgery; Medical education; Surgical education; Residency education; Artificial intelligence; Large language models; ChatGPT; GPT-4;
D O I
10.1227/neu.0000000000002632
中图分类号
R74 [神经病学与精神病学];
学科分类号
摘要
BACKGROUND AND OBJECTIVES: Interest surrounding generative large language models (LLMs) has rapidly grown. Although ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized examinations and the factors affecting accuracy remain unclear. This study aims to assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written board examination.METHODS: The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. chi 2, Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics.RESULTS: ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent (P = .963), GPT-4 outperformed both (both P < .001). GPT-4 answered every question answered correctly by ChatGPT and 37.6% (50/133) of remaining incorrect questions correctly. Among 12 question categories, GPT-4 significantly outperformed users in each but performed comparably with ChatGPT in 3 (functional, other general, and spine) and outperformed both users and ChatGPT for tumor questions. Increased word count (odds ratio = 0.89 of answering a question correctly per +10 words) and higher-order problem-solving (odds ratio = 0.40, P = .009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (both P > .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone.CONCLUSION: LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.
引用
收藏
页码:1353 / 1365
页数:13
相关论文
共 14 条
[1]  
Ali R., Neurosurgery, DOI [10.1227/neu.0000000000002618, DOI 10.1227/NEU.0000000000002618]
[2]   Study Behaviors and USMLE Step 1 Performance: Implications of a Student Self-Directed Parallel Curriculum [J].
Burk-Rafel, Jesse ;
Santen, Sally A. ;
Purkiss, Joel .
ACADEMIC MEDICINE, 2017, 92 (11) :S67-S74
[3]   How to develop machine learning models for healthcare [J].
Chen, Po-Hsuan Cameron ;
Liu, Yun ;
Peng, Lily .
NATURE MATERIALS, 2019, 18 (05) :410-414
[4]  
Gupta A, 2023, A Responsible Path to Generative AI in Healthcare
[5]   Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models [J].
Kung, Tiffany H. ;
Cheatham, Morgan ;
Medenilla, Arielle ;
Sillos, Czarina ;
De Leon, Lorie ;
Elepano, Camille ;
Madriaga, Maria ;
Aggabao, Rimel ;
Diaz-Candido, Giezel ;
Maningo, James ;
Tseng, Victor .
PLOS DIGITAL HEALTH, 2023, 2 (02)
[6]   A deep learning system for differential diagnosis of skin diseases [J].
Liu, Yuan ;
Jain, Ayush ;
Eng, Clara ;
Way, David H. ;
Lee, Kang ;
Bui, Peggy ;
Kanada, Kimberly ;
de Oliveira Marinho, Guilherme ;
Gallegos, Jessica ;
Gabriele, Sara ;
Gupta, Vishakha ;
Singh, Nalini ;
Natarajan, Vivek ;
Hofmann-Wellenhof, Rainer ;
Corrado, Greg S. ;
Peng, Lily H. ;
Webster, Dale R. ;
Ai, Dennis ;
Huang, Susan J. ;
Liu, Yun ;
Dunn, R. Carter ;
Coz, David .
NATURE MEDICINE, 2020, 26 (06) :900-+
[7]  
Martinez E., 2023, SSRN Electron J, P410
[8]  
Moran S., 2020, How to Prepare for the USMLE Step 1
[9]  
Nori H., 2023, ARXIV
[10]  
Nori Harsha, 2023, Capabilities of gpt-4 on medical challenge problems