Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations

被引:5
作者
Balta, Kaan Y. [1 ]
Javidan, Arshia P. [2 ]
Walser, Eric [3 ,4 ]
Arntfield, Robert [3 ]
Prager, Ross [3 ]
机构
[1] Western Univ, Schulich Sch Med & Dent, 1151 Richmond St, London, ON N6A5C1, Canada
[2] Univ Toronto, Dept Surg, Div Vasc Surg, Toronto, ON, Canada
[3] Western Univ, London Hlth Sci Ctr, Div Crit Care, London, ON, Canada
[4] London Hlth Sci Ctr, Dept Surg, Trauma Program, London, ON, Canada
关键词
appropriateness; artificial intelligence; critical care; large language models;
D O I
10.1177/08850666241267871
中图分类号
R4 [临床医学];
学科分类号
1002 ; 100602 ;
摘要
Background: We assessed 2 versions of the large language model (LLM) ChatGPT-versions 3.5 and 4.0-in generating appropriate, consistent, and readable recommendations on core critical care topics. Research Question: How do successive large language models compare in terms of generating appropriate, consistent, and readable recommendations on core critical care topics? Design and Methods: A set of 50 LLM-generated responses to clinical questions were evaluated by 2 independent intensivists based on a 5-point Likert scale for appropriateness, consistency, and readability. Results: ChatGPT 4.0 showed significantly higher median appropriateness scores compared to ChatGPT 3.5 (4.0 vs 3.0, P < .001). However, there was no significant difference in consistency between the 2 versions (40% vs 28%, P = 0.291). Readability, assessed by the Flesch-Kincaid Grade Level, was also not significantly different between the 2 models (14.3 vs 14.4, P = 0.93). Interpretation: Both models produced "hallucinations"-misinformation delivered with high confidence-which highlights the risk of relying on these tools without domain expertise. Despite potential for clinical application, both models lacked consistency producing different results when asked the same question multiple times. The study underscores the need for clinicians to understand the strengths and limitations of LLMs for safe and effective implementation in critical care settings. Registration: https://osf.io/8chj7/
引用
收藏
页码:184 / 190
页数:7
相关论文
共 23 条
[1]   Artificial Hallucinations in ChatGPT: Implications in Scientific Writing [J].
Alkaissi, Hussam ;
McFarlane, Samy I. .
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (02)
[2]   Large language models and the perils of their hallucinations [J].
Azamfirei, Razvan ;
Kudchadkar, Sapna R. ;
Fackler, James .
CRITICAL CARE, 2023, 27 (01)
[3]   Scite [J].
Brody, Stacy .
JOURNAL OF THE MEDICAL LIBRARY ASSOCIATION, 2021, 109 (04) :707-710
[4]   Evaluation of ChatGPT in Predicting 6-Month Outcomes After Traumatic Brain Injury [J].
Gakuba, Clement ;
Le Barbey, Charlene ;
Sar, Alexandre ;
Bonnet, Gregory ;
Cerasuolo, Damiano ;
Giabicani, Mikhael ;
Moyer, Jean-Denis .
CRITICAL CARE MEDICINE, 2024, 52 (06) :942-950
[5]   Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers [J].
Gao, Catherine A. ;
Howard, Frederick M. ;
Markov, Nikolay S. ;
Dyer, Emma C. ;
Ramesh, Siddhi ;
Luo, Yuan ;
Pearson, Alexander T. .
NPJ DIGITAL MEDICINE, 2023, 6 (01)
[6]  
goodcalculators, Flesch Kincaid Calculator|Good Calculators
[7]  
google, Bard-chat based ai tool from google, powered by palm 2
[8]  
Haver HL., 2023, APPROPRIATENESS BREA, DOI [10.1148/RADIOL.230424, DOI 10.1148/RADIOL.230424]
[9]  
hmjournals, VIEW USE CHAT GPT SO
[10]   ChatGPT and antimicrobial advice: the end of the consulting infection doctor? [J].
Howard, Alex ;
Hope, William ;
Gerada, Alessandro .
LANCET INFECTIOUS DISEASES, 2023, 23 (04) :405-406