Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study

被引：0

作者：

Workum, Jessica D. ^{[1
,2
,3
]}

Volkers, Bas W. S. ^{[1
,3
]}

van de Sande, Davy ^{[1
,3
]}

Arora, Sumesh ^{[4
]}

Goeijenbier, Marco ^{[1
,5
]}

Gommers, Diederik ^{[1
,3
]}

van Genderen, Michel E. ^{[1
,3
]}

机构：

[1] Erasmus MC Univ Med Ctr, Dept Adult Intens Care, Rotterdam, Netherlands

[2] Elisabeth Tweesteden Hosp, Dept Intens Care, Tilburg, Netherlands

[3] Erasmus MC Univ Med Ctr, Erasmus MC Datahub, Rotterdam, Netherlands

[4] Prince Wales Hosp, Sydney, Australia

[5] Spaarne Gasthuis, Dept Intens Care Med, Hoofddorp, Netherlands

来源：

CRITICAL CARE | 2025年 / 29卷 / 01期

关键词：

Large language models; Generative artificial intelligence; Critical care; Benchmarking;

D O I：

10.1186/s13054-025-05302-0

中图分类号：

R4 [临床医学];

学科分类号：

1002 ; 100602 ;

摘要：

Background Large language models (LLMs) show increasing potential for their use in healthcare for administrative support and clinical decision making. However, reports on their performance in critical care medicine is lacking. Methods This study evaluated five LLMs (GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Mistral Large 2407 and Llama 3.1 70B) on 1181 multiple choice questions (MCQs) from the gotheextramile.com database, a comprehensive database of critical care questions at European Diploma in Intensive Care examination level. Their performance was compared to random guessing and 350 human physicians on a 77-MCQ practice test. Metrics included accuracy, consistency, and domain-specific performance. Costs, as a proxy for energy consumption, were also analyzed. Results GPT-4o achieved the highest accuracy at 93.3%, followed by Llama 3.1 70B (87.5%), Mistral Large 2407 (87.9%), GPT-4o-mini (83.0%), and GPT-3.5-turbo (72.7%). Random guessing yielded 41.5% (p < 0.001). On the practice test, all models surpassed human physicians, scoring 89.0%, 80.9%, 84.4%, 80.3%, and 66.5%, respectively, compared to 42.7% for random guessing (p < 0.001) and 61.9% for the human physicians. However, in contrast to the other evaluated LLMs (p < 0.001), GPT-3.5-turbo's performance did not significantly outperform physicians (p = 0.196). Despite high overall consistency, all models gave consistently incorrect answers. The most expensive model was GPT-4o, costing over 25 times more than the least expensive model, GPT-4o-mini. Conclusions LLMs exhibit exceptional accuracy and consistency, with four outperforming human physicians on a European-level practice exam. GPT-4o led in performance but raised concerns about energy consumption. Despite their potential in critical care, all models produced consistently incorrect answers, highlighting the need for more thorough and ongoing evaluations to guide responsible implementation in clinical settings.

引用

页数：8

共 22 条

[1] Ali S, 2023, medRxiv, DOI [10.1101/2023.09.21.23295918, 10.1101/2023.09.21.23295918, DOI 10.1101/2023.09.21.23295918]
[2] Arora S., gotheextramile.com Internet
[3] Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations
Balta, Kaan Y.
Javidan, Arshia P.
Walser, Eric
Arntfield, Robert
Prager, Ross
[J]. JOURNAL OF INTENSIVE CARE MEDICINE, 2025, 40 (02) : 184 - 190
[4] Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians
Cabral, Stephanie
Restrepo, Daniel
Kanjee, Zahir
Wilson, Philip
Crowe, Byron
Abdulnour, Raja-Elie
Rodman, Adam
[J]. JAMA INTERNAL MEDICINE, 2024, 184 (05) : 581 - 583
[5] Chen HJ, 2024, Arxiv, DOI arXiv:2402.18060
[6] Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication
Chung, Philip
Fong, Christine T.
Walters, Andrew M.
Aghaeepour, Nima
Yetisgen, Meliha
O'Reilly-Shah, Vikas N.
[J]. JAMA SURGERY, 2024, 159 (08) : 928 - 937
[7] Gallifant J, medRxiv, DOI [10.1101/2024.07.24.24310930, DOI 10.1101/2024.07.24.24310930]
[8] Large Language Models lack essential metacognition for reliable medical reasoning
Griot, Maxime
Hemptinne, Coralie
Vanderdonckt, Jean
Yuksel, Demet
[J]. NATURE COMMUNICATIONS, 2025, 16 (01)
[9] James Fiona R, 2018, J Intensive Care Soc, V19, P247, DOI 10.1177/1751143717746566
[10] Johannes Husom E, 2024, arXiv

← 1 2 3 →