Systematic review: The use of large language models as medical chatbots in digestive diseases

被引:11
作者
Giuffre, Mauro [1 ,2 ,7 ]
Kresevic, Simone [3 ]
You, Kisung [4 ]
Dupont, Johannes [1 ]
Huebner, Jack [5 ]
Grimshaw, Alyssa Ann [6 ]
Shung, Dennis Legen [1 ,7 ]
机构
[1] Yale Sch Med, Dept Internal Med Digest Dis, New Haven, CT USA
[2] Univ Trieste, Dept Med Surg & Hlth Sci, Trieste, Italy
[3] Univ Trieste, Dept Engn & Architecture, Trieste, Italy
[4] CUNY, Dept Math, Baruch Coll, New York, NY USA
[5] Yale Sch Med, Dept Internal Med, New Haven, CT USA
[6] Yale Univ, Harvey Cushing John Hay Whitney Med Lib, Res & Educ Librarian Clin, New Haven, CT USA
[7] Dept Internal Med Digest Dis, POB 208019, New Haven, CT 05520 USA
关键词
CHATGPT;
D O I
10.1111/apt.18058
中图分类号
R57 [消化系及腹部疾病];
学科分类号
摘要
BackgroundInterest in large language models (LLMs), such as OpenAI's ChatGPT, across multiple specialties has grown as a source of patient-facing medical advice and provider-facing clinical decision support. The accuracy of LLM responses for gastroenterology and hepatology-related questions is unknown.AimsTo evaluate the accuracy and potential safety implications for LLMs for the diagnosis, management and treatment of questions related to gastroenterology and hepatology.MethodsWe conducted a systematic literature search including Cochrane Library, Google Scholar, Ovid Embase, Ovid MEDLINE, PubMed, Scopus and the Web of Science Core Collection to identify relevant articles published from inception until January 28, 2024, using a combination of keywords and controlled vocabulary for LLMs and gastroenterology or hepatology. Accuracy was defined as the percentage of entirely correct answers.ResultsAmong the 1671 reports screened, we identified 33 full-text articles on using LLMs in gastroenterology and hepatology and included 18 in the final analysis. The accuracy of question-responding varied across different model versions. For example, accuracy ranged from 6.4% to 45.5% with ChatGPT-3.5 and was between 40% and 91.4% with ChatGPT-4. In addition, the absence of standardised methodology and reporting metrics for studies involving LLMs places all the studies at a high risk of bias and does not allow for the generalisation of single-study results.ConclusionsCurrent general-purpose LLMs have unacceptably low accuracy on clinical gastroenterology and hepatology tasks, which may lead to adverse patient safety events through incorrect information or triage recommendations, which might overburden healthcare systems or delay necessary care. Available large language models are not accurate enough to be deployed in real-life clinical practice, despite their user-friendly interfaces and rapid improvement cycles. The absence of standardised methods and benchmarks will probably delay their safe deployment into real-life clinical settings.image
引用
收藏
页码:144 / 166
页数:23
相关论文
共 62 条
  • [1] [Anonymous], 2021, YALE U HARVEY CUSHIN
  • [2] [Anonymous], Embeddings
  • [3] Applicability of Online Chat-Based Artificial Intelligence Models to Colorectal Cancer Screening
    Atarere, Joseph
    Naqvi, Haider
    Haas, Christopher
    Adewunmi, Comfort
    Bandaru, Sumanth
    Allamneni, Rakesh
    Ugonabo, Onyinye
    Egbo, Olachi
    Umoren, Mfoniso
    Kanth, Priyanka
    [J]. DIGESTIVE DISEASES AND SCIENCES, 2024, 69 (03) : 791 - 797
  • [4] Baily M, Machines of mind: the case for an AI-powered productivity boom
  • [5] Benchmarking medical large language models
    Bakhshandeh, Sadra
    [J]. NATURE REVIEWS BIOENGINEERING, 2023, 1 (08): : 543 - 543
  • [6] Cai TT., 2021, THEORETICAL FDN TSNE
  • [7] Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline
    Campbell, Mhairi
    McKenzie, Joanne E.
    Sowden, Amanda
    Katikireddi, Srinivasa Vittal
    Brennan, Sue E.
    Ellis, Simon
    Hartmann-Boyce, Jamie
    Ryan, Rebecca
    Shepperd, Sasha
    Thomas, James
    Welch, Vivian
    Thomson, Hilary
    [J]. BMJ-BRITISH MEDICAL JOURNAL, 2020, 368
  • [8] Reliability and Usefulness of ChatGPT for Inflammatory Bowel Diseases: An Analysis for Patients and Healthcare Professionals
    Cankurtaran, Rasim Eren
    Polat, Yunus Halil
    Aydemir, Neslihan Gunes
    Umay, Ebru
    Yurekli, Oyku Tayfur
    [J]. CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (10)
  • [9] Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis
    Cao, Jennie J.
    Kwon, Daniel H.
    Ghaziani, Tara T.
    Kwo, Paul
    Tse, Gary
    Kesselman, Andrew
    Kamaya, Aya
    Tse, Justin R.
    [J]. AMERICAN JOURNAL OF ROENTGENOLOGY, 2023, 221 (04) : 556 - 559
  • [10] Chen B., 2023, Unleashing the potential of prompt engineering in large language models: a comprehensive review