Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability

被引:3
|
作者
Cao, Jennie J. [1 ]
Kwon, Daniel H. [2 ]
Ghaziani, Tara T. [3 ]
Kwo, Paul [3 ]
Tse, Gary [4 ]
Kesselman, Andrew [5 ]
Kamaya, Aya [1 ]
Tse, Justin R. [1 ]
机构
[1] Stanford Univ, Sch Med, Dept Radiol, 300 Pasteur Dr,Room H-1307, Stanford, CA 94305 USA
[2] Univ Calif San Francisco, San Francisco Sch Med, Dept Med, 505 Parnassus Ave,MC1286C, San Francisco, CA 94144 USA
[3] Stanford Univ, Dept Med, Sch Med, 430 Broadway St MC 6341, Redwood City, CA 94063 USA
[4] Univ Calif Los Angeles, Los Angeles David Geffen Sch Med, Dept Radiol Sci, 757 Westwood Plaza Los Angeles, Los Angeles, CA 90095 USA
[5] Stanford Univ, Sch Med, Dept Urol, 875 Blake Wilbur Dr Palo Alto, Stanford, CA 94304 USA
关键词
Liver cancer; Hepatocellular carcinoma; Artificial intelligence; Large language model; CHATGPT;
D O I
10.1007/s00261-024-04501-7
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Purpose To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management. Methods Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests. Results Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001). Conclusion Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.
引用
收藏
页码:4286 / 4294
页数:9
相关论文
共 34 条
  • [1] Evaluating the reliability of the responses of large language models to keratoconus-related questions
    Kayabasi, Mustafa
    Koksaldi, Seher
    Engin, Ceren Durmaz
    CLINICAL AND EXPERIMENTAL OPTOMETRY, 2024,
  • [2] Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots
    Mondal, Himel
    Tiu, Devendra Nath
    Mondal, Shaikat
    Dutta, Rajib
    Naskar, Avijit
    Podder, Indrashis
    JOURNAL OF MID-LIFE HEALTH, 2025, 16 (01) : 45 - 50
  • [3] Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry
    Ozdemir, Zeyneb Merve
    Yapici, Emre
    JOURNAL OF ESTHETIC AND RESTORATIVE DENTISTRY, 2025,
  • [4] Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy
    Tepe, Murat
    Emekli, Emre
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (05)
  • [5] Clinical Accuracy of Large Language Models and Google Search Responses to Postpartum Depression Questions: Cross-Sectional Study
    Sezgin, Emre
    Chekeni, Faraaz
    Lee, Jennifer
    Keim, Sarah
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [6] Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology
    Zhu, Kexin
    Zhang, Jiajie
    Klishin, Anton
    Esser, Mario
    Blumentals, William A.
    Juhaeri, Juhaeri
    Jouquelet-Royer, Corinne
    Sinnott, Sarah-Jo
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2025, 34 (02)
  • [7] Comparison of Large Language Models in Diagnosis and Management of Challenging Clinical Cases
    Shanmugam, Sujeeth Krishna
    Browning, David J.
    CLINICAL OPHTHALMOLOGY, 2024, 18 : 3239 - 3247
  • [8] Comment on: "Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology"
    Beltramin, Diva
    Bousquet, Cedric
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2025, 34 (03)
  • [9] High accuracy but limited readability of large language model-generated responses to frequently asked questions about Kienböck's disease
    Asfuroglu, Zeynel Mert
    Yagar, Hilal
    Gumusoglu, Ender
    BMC MUSCULOSKELETAL DISORDERS, 2024, 25 (01)
  • [10] A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone
    Tailor, Prashant D.
    Dalvin, Lauren A.
    Chen, John J.
    Iezzi, Raymond
    Olsen, Timothy W.
    Scruggs, Brittni A.
    Barkmeier, Andrew J.
    Bakri, Sophie J.
    Ryan, Edwin H.
    Tang, Peter H.
    Parke, D. Wilkin., III
    Belin, Peter J.
    Sridhar, Jayanth
    Xu, David
    Kuriyan, Ajay E.
    Yonekawa, Yoshihiro
    Starr, Matthew R.
    OPHTHALMOLOGY SCIENCE, 2024, 4 (04):