Performance of Large Language Models on a Neurology Board-Style Examination

被引：24

作者：

Schubert, Marc Cicero ^{[1
,2
]}

Wick, Wolfgang ^{[1
,2
]}

Venkataramani, Varun ^{[1
,2
]}

机构：

[1] Univ Hosp Heidelberg, Neurol Clin, Neuenheimer Feld 400, D-69120 Heidelberg, Germany

[2] Univ Hosp Heidelberg, Natl Ctr Tumor Dis, Neuenheimer Feld 400, D-69120 Heidelberg, Germany

来源：

JAMA NETWORK OPEN | 2023年 / 6卷 / 12期

关键词：

D O I：

10.1001/jamanetworkopen.2023.46721

中图分类号：

R5 [内科学];

学科分类号：

1002 ; 100201 ;

摘要：

Importance Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored.Objective To assess the performance of LLMs on neurology board-style examinations.Design, Setting, and Participants This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers.Main Outcomes and Measures Overall percentage scores of 2 LLMs.Results LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers.Conclusions and Relevance Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

引用

页数：11

共 50 条

[1] Performance of large language models on a neurology board-style examination (vol 6, e2346721, 2023)
Schubert, M. C.
Wick, W.
JAMA NETWORK OPEN, 2024, 7 (01)
[2] Performance of Generative Large Language Models on Ophthalmology Board-Style Questions
Cai, Louis Z.
Shaheen, Abdulla
Jin, Andrew
Fukui, Riya
Yi, Jonathan S.
Yannuzzi, Nicolas
Alabiad, Chrisfouad
AMERICAN JOURNAL OF OPHTHALMOLOGY, 2023, 254 : 141 - 149
[3] Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models
Khan, Adnan A.
Yunus, Rayaan
Sohail, Mahad
Rehman, Taha A.
Saeed, Shirin
Bu, Yifan
Jackson, Cullen D.
Sharkey, Aidan
Mahmood, Feroze
Matyal, Robina
JOURNAL OF CARDIOTHORACIC AND VASCULAR ANESTHESIA, 2024, 38 (05) : 1251 - 1259
[4] Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions
Tarabanis, Constantine
Zahid, Sohail
Mamalis, Marios
Zhang, Kevin
Kalampokis, Evangelos
Jankelson, Lior
PLOS DIGITAL HEALTH, 2024, 3 (09):
[5] Llama 3 Challenges Proprietary State-of-the-Art Large Language Models in Radiology Board-style Examination Questions
Adams, Lisa C.
Truhn, Daniel
Busch, Felix
Dorfner, Felix
Nawabi, Jawed
Makowski, Marcus R.
Bressem, Keno K.
RADIOLOGY, 2024, 312 (02)
[6] Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions
Mistry, Neel P.
Saeed, Huzaifa
Rafique, Sidra
Le, Thuy
Obaid, Haron
Adams, Scott J.
ACADEMIC RADIOLOGY, 2024, 31 (09) : 3872 - 3878
[7] Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis
Wu, Jo-Hsuan
Nishida, Takashi
Liu, T. Y. Alvin
ASIA-PACIFIC JOURNAL OF OPHTHALMOLOGY, 2024, 13 (05):
[8] Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations
Bhayana, Rajesh
Krishna, Satheesh
Bleakney, Robert R.
RADIOLOGY, 2023, 307 (05)
[9] The performance of artificial intelligence language models in board-style dental knowledge assessment A preliminary study on ChatGPT
Danesh, Arman
Pazouki, Hirad
Danesh, Kasra
Danesh, Farzad
Danesh, Arsalan
JOURNAL OF THE AMERICAN DENTAL ASSOCIATION, 2023, 154 (11): : 970 - 974
[10] Artificial Intelligence Showdown in Gastroenterology: A Comparative Analysis of Large Language Models (LLMs) in Tackling Board-Style Review Questions
Shah, Kevin P.
Dey, Shirin A.
Pothula, Shravya
Abud, Arnold
Jain, Sukrit
Srivastava, Aniruddha
Dommaraju, Sagar
Komanduri, Srinadh
AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1567 - S1568

← 1 2 3 4 5 →