Performance of Large Language Models on a Neurology Board-Style Examination

被引:24
|
作者
Schubert, Marc Cicero [1 ,2 ]
Wick, Wolfgang [1 ,2 ]
Venkataramani, Varun [1 ,2 ]
机构
[1] Univ Hosp Heidelberg, Neurol Clin, Neuenheimer Feld 400, D-69120 Heidelberg, Germany
[2] Univ Hosp Heidelberg, Natl Ctr Tumor Dis, Neuenheimer Feld 400, D-69120 Heidelberg, Germany
关键词
D O I
10.1001/jamanetworkopen.2023.46721
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Importance Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored.Objective To assess the performance of LLMs on neurology board-style examinations.Design, Setting, and Participants This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers.Main Outcomes and Measures Overall percentage scores of 2 LLMs.Results LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers.Conclusions and Relevance Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Performance of large language models on a neurology board-style examination (vol 6, e2346721, 2023)
    Schubert, M. C.
    Wick, W.
    JAMA NETWORK OPEN, 2024, 7 (01)
  • [2] Performance of Generative Large Language Models on Ophthalmology Board-Style Questions
    Cai, Louis Z.
    Shaheen, Abdulla
    Jin, Andrew
    Fukui, Riya
    Yi, Jonathan S.
    Yannuzzi, Nicolas
    Alabiad, Chrisfouad
    AMERICAN JOURNAL OF OPHTHALMOLOGY, 2023, 254 : 141 - 149
  • [3] Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models
    Khan, Adnan A.
    Yunus, Rayaan
    Sohail, Mahad
    Rehman, Taha A.
    Saeed, Shirin
    Bu, Yifan
    Jackson, Cullen D.
    Sharkey, Aidan
    Mahmood, Feroze
    Matyal, Robina
    JOURNAL OF CARDIOTHORACIC AND VASCULAR ANESTHESIA, 2024, 38 (05) : 1251 - 1259
  • [4] Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions
    Tarabanis, Constantine
    Zahid, Sohail
    Mamalis, Marios
    Zhang, Kevin
    Kalampokis, Evangelos
    Jankelson, Lior
    PLOS DIGITAL HEALTH, 2024, 3 (09):
  • [5] Llama 3 Challenges Proprietary State-of-the-Art Large Language Models in Radiology Board-style Examination Questions
    Adams, Lisa C.
    Truhn, Daniel
    Busch, Felix
    Dorfner, Felix
    Nawabi, Jawed
    Makowski, Marcus R.
    Bressem, Keno K.
    RADIOLOGY, 2024, 312 (02)
  • [6] Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions
    Mistry, Neel P.
    Saeed, Huzaifa
    Rafique, Sidra
    Le, Thuy
    Obaid, Haron
    Adams, Scott J.
    ACADEMIC RADIOLOGY, 2024, 31 (09) : 3872 - 3878
  • [7] Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis
    Wu, Jo-Hsuan
    Nishida, Takashi
    Liu, T. Y. Alvin
    ASIA-PACIFIC JOURNAL OF OPHTHALMOLOGY, 2024, 13 (05):
  • [8] Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations
    Bhayana, Rajesh
    Krishna, Satheesh
    Bleakney, Robert R.
    RADIOLOGY, 2023, 307 (05)
  • [9] The performance of artificial intelligence language models in board-style dental knowledge assessment A preliminary study on ChatGPT
    Danesh, Arman
    Pazouki, Hirad
    Danesh, Kasra
    Danesh, Farzad
    Danesh, Arsalan
    JOURNAL OF THE AMERICAN DENTAL ASSOCIATION, 2023, 154 (11): : 970 - 974
  • [10] Artificial Intelligence Showdown in Gastroenterology: A Comparative Analysis of Large Language Models (LLMs) in Tackling Board-Style Review Questions
    Shah, Kevin P.
    Dey, Shirin A.
    Pothula, Shravya
    Abud, Arnold
    Jain, Sukrit
    Srivastava, Aniruddha
    Dommaraju, Sagar
    Komanduri, Srinadh
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1567 - S1568