Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale

被引:21
作者
Franc, Jeffrey Michael [1 ,2 ,3 ]
Cheng, Lenard [4 ,5 ]
Hart, Alexander [4 ,6 ,7 ]
Hata, Ryan [4 ,5 ]
Hertelendy, Atilla [4 ,8 ]
机构
[1] Univ Alberta, Dept Emergency Med, Edmonton, AB, Canada
[2] Univ Alberta, Fac Med, Edmonton, AB, Canada
[3] Univ Piemonte Orientale, Novara, Italy
[4] Beth Israel Deaconess Med Ctr, Dept Emergency Med, Boston, MA USA
[5] Harvard Med Sch, Boston, MA USA
[6] Hartford Hosp, Dept Emergency Med, Hartford, CT 06115 USA
[7] Univ Connecticut, Sch Med, Farmington, CT USA
[8] Florida Int Univ, Coll Business, Dept Informat Syst & Business Analyt, Miami, FL USA
关键词
Emergency medicine; Triage; Artificial intelligence; Large language models; Canadian triage and acuity scale; Medecine d'urgence; Intelligence artificielle; Grands modeles linguistiques; echelle canadienne de triage et d'acuite;
D O I
10.1007/s43678-023-00616-w
中图分类号
R4 [临床医学];
学科分类号
1002 ; 100602 ;
摘要
PurposeThe release of the ChatGPT prototype to the public in November 2022 drastically reduced the barrier to using artificial intelligence by allowing easy access to a large language model with only a simple web interface. One situation where ChatGPT could be useful is in triaging patients arriving to the emergency department. This study aimed to address the research problem: "can emergency physicians use ChatGPT to accurately triage patients using the Canadian Triage and Acuity Scale (CTAS)?".MethodsSix unique prompts were developed independently by five emergency physicians. An automated script was used to query ChatGPT with each of the 6 prompts combined with 61 validated and previously published patient vignettes. Thirty repetitions of each combination were performed for a total of 10,980 simulated triages.ResultsIn 99.6% of 10,980 queries, a CTAS score was returned. However, there was considerable variations in results. Repeatability (use of the same prompt repeatedly) was responsible for 21.0% of overall variation. Reproducibility (use of different prompts) was responsible for 4.0% of overall variation. Overall accuracy of ChatGPT to triage simulated patients was 47.5% with a 13.7% under-triage rate and a 38.7% over-triage rate. More extensively detailed text given as a prompt was associated with greater reproducibility, but minimal increase in accuracy.ConclusionsThis study suggests that the current ChatGPT large language model is not sufficient for emergency physicians to triage simulated patients using the Canadian Triage and Acuity Scale due to poor repeatability and accuracy. Medical practitioners should be aware that while ChatGPT can be a valuable tool, it may lack consistency and may frequently provide false information. ObjectifLa sortie du prototype ChatGPT au public en novembre 2022 a considerablement reduit l'obstacle a l'utilisation de l'intelligence artificielle en permettant un acces facile a un grand modele de langage avec une interface web simple. Une situation ou ChatGPT pourrait etre utile est de trier les patients qui arrivent au service d'urgence. Cette etude visait a resoudre le probleme de la recherche : << Les medecins d'urgence peuvent-ils utiliser ChatGPT pour trier avec precision les patients a l'aide de l'echelle canadienne de triage et d'acuite (ECTC) ?>>.MethodesSix invites uniques ont ete elaborees independamment par cinq urgentologues. Un script automatise a ete utilise pour interroger ChatGPT avec chacune des six invites combinees a 61 vignettes de patients validees et precedemment publiees. Trente repetitions de chaque combinaison ont ete realisees pour un total de 10980 triages simules.ResultatsDans 99.6 % des 10980 requetes, un score CTAS a ete obtenu. Cependant, il y a eu des variations considerables dans les resultats. La repetabilite (utilisation repetee de la meme invite) etait responsable de 21.0 % de la variation globale. La reproductibilite (utilisation de differentes invites) etait responsable de 4.0 % de la variation globale. La precision globale de ChatGPT pour le triage des patients simules etait de 47.5 %, avec un taux de sous-triage de 13.7 % et un taux de triage superieur de 38.7 %. Un texte plus detaille donne a titre d'invite etait associe a une plus grande reproductibilite, mais a une augmentation minimale de la precision.ConclusionsCette etude suggere que le modele actuel de ChatGPT en langage large n'est pas suffisant pour permettre aux medecins d'urgence de trier des patients simules a l'aide de l'echelle canadienne de triage et d'acuite en raison de la faible repetabilite et de la faible precision. Les medecins doivent etre conscients que, bien que ChatGPT puisse etre un outil precieux, il peut manquer de coherence et fournir frequemment de fausses informations.
引用
收藏
页码:40 / 46
页数:7
相关论文
共 17 条
[1]   Artificial Hallucinations in ChatGPT: Implications in Scientific Writing [J].
Alkaissi, Hussam ;
McFarlane, Samy I. .
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (02)
[2]  
asq.org, GR R GAG REP REPR
[3]  
Balas M, 2023, JFO Open Ophthalmology, V1, DOI DOI 10.1016/J.JFOP.2023.100005
[4]   Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI) [J].
Cadamuro, Janne ;
Cabitza, Federico ;
Debeljak, Zeljko ;
De Bruyne, Sander ;
Frans, Glynis ;
Perez, Salomon Martin ;
Ozdemir, Habib ;
Tolios, Alexander ;
Carobene, Anna ;
Padoan, Andrea .
CLINICAL CHEMISTRY AND LABORATORY MEDICINE, 2023, 61 (07) :1158-1166
[5]  
Canadian Association of Emergency Medicine, 2013, CAN TRIAG AC SCAL CO
[6]  
Choudhary N, 2017, Qual Prog, V50, P42
[7]  
ctas-phctas.ca, CAN TRIAG AC SCAL
[8]   A pilot study examining the speed and accuracy of triage for simulated disaster patients in an emergency department setting: Comparison of a computerized version of Canadian Triage Acuity Scale (CTAS) and Simple Triage and Rapid Treatment (START) methods [J].
Curran-Sills, Gwynn ;
Franc, Jeffrey M. .
CANADIAN JOURNAL OF EMERGENCY MEDICINE, 2017, 19 (05) :364-371
[9]   ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations [J].
Dave, Tirth ;
Athaluri, Sai Anirudh ;
Singh, Satyam .
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
[10]  
del Rio J.F., 2009, Nature, V585, P357