Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study

被引：27

作者：

Masanneck, Lars ^{[1
,2
,3
]}

Schmidt, Linea ^{[3
]}

Seifert, Antonia ^{[2
,4
]}

Koelsche, Tristan ^{[1
,2
]}

Huntemann, Niklas ^{[1
,2
]}

Jansen, Robin ^{[1
,2
]}

Mehsin, Mohammed ^{[1
,2
]}

Bernhard, Michael ^{[4
]}

Meuth, Sven G. ^{[1
,2
]}

Boehm, Lennert ^{[2
,4
]}

Pawlitzki, Marc ^{[1
,2
]}

机构：

[1] Heinrich Heine Univ Dusseldorf, Med Fac, Dept Neurol, Moorenstr 5, D-40225 Dusseldorf, Germany

[2] Heinrich Heine Univ Dusseldorf, Univ Hosp Dusseldorf, Dusseldorf, Germany

[3] Univ Potsdam, Hasso Plattner Inst, Digital Hlth Ctr, Potsdam, Germany

[4] Heinrich Heine Univ Dusseldorf, Med Fac, Emergency Dept, Dusseldorf, Germany

来源：

JOURNAL OF MEDICAL INTERNET RESEARCH | 2024年 / 26卷

关键词：

emergency medicine; triage; artificial intelligence; large language models; ChatGPT; untrained doctors; doctor; doctors; comparative study; digital health; personnel; staff; cohort; Germany; German; AGREEMENT;

D O I：

10.2196/53297

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

Background: Large language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage. Objective: This study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models' responses can enhance the triage proficiency of untrained personnel. Methods: A total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters' MTS level assignments, measured via quadratic-weighted Cohen kappa. The extent of overand undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b. Results: GPT-4-based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (kappa=mean 0.67, SD 0.037 and kappa=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5-based ChatGPT (kappa=mean 0.54, SD 0.024; P<.001). When untrained doctors used this LLM for second-opinion triage, there was a slight but statistically insignificant performance increase (kappa=mean 0.70, SD 0.047; P=.97). Other tested LLMs performed similar to or worse than GPT-4-based ChatGPT or showed odd triaging behavior with the used parameters. LLMs and ChatGPT models tended toward overtriage, whereas untrained doctors undertriaged. Conclusions: While LLMs and the LLM-based product ChatGPT do not yet match professionally trained raters, their best models' triage proficiency equals that of untrained ED doctors. In its current form, LLMs or ChatGPT thus did not demonstrate gold-standard performance in ED triage and, in the setting of this study, failed to significantly improve untrained doctors' triage when used as decision support. Notable performance enhancements in newer LLM versions over older ones hint at future improvements with further technological development and specific training.

引用

页数：10

共 41 条

[1] Machine learning for ECG diagnosis and risk stratification of occlusion myocardial infarction [J].

Al-Zaiti, Salah S. ;

Martin-Gill, Christian ;

Zegre-Hemsey, Jessica K. ;

Bouzid, Zeineb ;

Faramand, Ziad ;

Alrawashdeh, Mohammad O. ;

Gregg, Richard E. ;

Helman, Stephanie ;

Riek, Nathan T. ;

Kraevsky-Phillips, Karina ;

Clermont, Gilles ;

Akcakaya, Murat ;

Sereika, Susan M. ;

Van Dam, Peter ;

Smith, Stephen W. ;

Birnbaum, Yochai ;

Saba, Samir ;

Sejdic, Ervin ;

Callaway, Clifton W. .

NATURE MEDICINE, 2023, 29 (07) :1804-+

[2]

[Anonymous], 2022, OpenAI

[3] Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum [J].

Ayers, John W. ;

Poliak, Adam ;

Dredze, Mark ;

Leas, Eric C. ;

Zhu, Zechariah ;

Kelley, Jessica B. ;

Faix, Dennis J. ;

Goodman, Aaron M. ;

Longhurst, Christopher A. ;

Hogarth, Michael ;

Smith, Davey M. .

JAMA INTERNAL MEDICINE, 2023, 183 (06) :589-596

[4] Influence of artificial intelligence on the work design of emergency department clinicians a systematic literature review [J].

Boonstra, Albert ;

Laven, Mente .

BMC HEALTH SERVICES RESEARCH, 2022, 22 (01)

[5] A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES [J].

COHEN, J .

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) :37-46

[6]

Dickson SJ, 2022, EUR J EMERG MED, V29, P49, DOI [10.1097/MEJ.0000000000000863, DOI 10.1097/MEJ.0000000000000863]

[7] Factors contributing to patient safety during triage process in the emergency department: A systematic review [J].

Fekonja, Zvonka ;

Kmetec, Sergej ;

Fekonja, Urska ;

Reljic, Natasa Mlinar ;

Pajnkihar, Majda ;

Strnad, Matej .

JOURNAL OF CLINICAL NURSING, 2023, 32 (17-18) :5461-5477

[8] Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers [J].

Gao, Catherine A. ;

Howard, Frederick M. ;

Markov, Nikolay S. ;

Dyer, Emma C. ;

Ramesh, Siddhi ;

Luo, Yuan ;

Pearson, Alexander T. .

NPJ DIGITAL MEDICINE, 2023, 6 (01)

[9] Stress in Emergency Healthcare Professionals: The Stress Factors and Manifestations Scale [J].

Garcia-Tudela, Angel ;

Javier Simonelli-Munoz, Agustin ;

Miguel Rivera-Caravaca, Jose ;

Isabel Fortea, Maria ;

Simon-Sanchez, Lucas ;

Rodriguez Gonzalez-Moro, Maria Teresa ;

Rodriguez Gonzalez-Moro, Jose Miguel ;

Jimenez-Rodriguez, Diana ;

Ines Gallego-Gomez, Juana .

INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2022, 19 (07)

[10]

Gemini 1.5. Google, About us

← 1 2 3 4 5 →