Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability

被引:11
作者
Cao, Jennie J. [1 ]
Kwon, Daniel H. [2 ]
Ghaziani, Tara T. [3 ]
Kwo, Paul [3 ]
Tse, Gary [4 ]
Kesselman, Andrew [5 ]
Kamaya, Aya [1 ]
Tse, Justin R. [1 ]
机构
[1] Stanford Univ, Sch Med, Dept Radiol, 300 Pasteur Dr,Room H-1307, Stanford, CA 94305 USA
[2] Univ Calif San Francisco, San Francisco Sch Med, Dept Med, 505 Parnassus Ave,MC1286C, San Francisco, CA 94144 USA
[3] Stanford Univ, Dept Med, Sch Med, 430 Broadway St MC 6341, Redwood City, CA 94063 USA
[4] Univ Calif Los Angeles, Los Angeles David Geffen Sch Med, Dept Radiol Sci, 757 Westwood Plaza Los Angeles, Los Angeles, CA 90095 USA
[5] Stanford Univ, Sch Med, Dept Urol, 875 Blake Wilbur Dr Palo Alto, Stanford, CA 94304 USA
关键词
Liver cancer; Hepatocellular carcinoma; Artificial intelligence; Large language model; CHATGPT;
D O I
10.1007/s00261-024-04501-7
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Purpose To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management. Methods Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests. Results Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001). Conclusion Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.
引用
收藏
页码:4286 / 4294
页数:9
相关论文
共 47 条
[31]   Utilizing large language models in breast cancer management: systematic review [J].
Sorin, Vera ;
Glicksberg, Benjamin S. ;
Artsi, Yaara ;
Barash, Yiftach ;
Konen, Eli ;
Nadkarni, Girish N. ;
Klang, Eyal .
JOURNAL OF CANCER RESEARCH AND CLINICAL ONCOLOGY, 2024, 150 (03)
[32]   MDD-LLM: Towards accuracy large language models for major depressive disorder diagnosis [J].
Sha, Yuyang ;
Pan, Hongxin ;
Xu, Wei ;
Meng, Weiyu ;
Luo, Gang ;
Du, Xinyu ;
Zhai, Xiaobing ;
Tong, Henry H. Y. ;
Shi, Caijuan ;
Li, Kefeng .
JOURNAL OF AFFECTIVE DISORDERS, 2025, 388
[33]   Transforming breast cancer diagnosis and treatment with large language Models: A comprehensive survey [J].
Ghorbian, Mohsen ;
Ghobaei-Arani, Mostafa ;
Ghorbian, Saied .
METHODS, 2025, 239 :85-110
[34]   Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models-Bing, Claude, ChatGPT, and Perplexity [J].
Sarangi, Pradosh Kumar ;
Datta, Suvrankar ;
Swarup, M. Sarthak ;
Panda, Swaha ;
Nayak, Debasish Swapnesh Kumar ;
Malik, Archana ;
Datta, Ananda ;
Mondal, Himel .
INDIAN JOURNAL OF RADIOLOGY AND IMAGING, 2024, 34 (04) :653-660
[35]   Accuracy of Prospective Assessments of 4 Large Language ModelChatbot Responses to Patient Questions About Emergency Care:Experimental Comparative Study [J].
Yau, Jonathan Yi-Shin ;
Saadat, Soheil ;
Hsu, Edmund ;
Murphy, Linda Suk-Ling ;
Roh, Jennifer S. ;
Suchard, Jeffrey ;
Tapia, Antonio ;
Wiechman, Warren ;
Langdorf, Mark, I .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26 :e60291
[36]   Assessing the Quality, Usefulness, and Reliability of Large Language Models (ChatGPT, DeepSeek, and Gemini) in Answering General Questions Regarding Dyslexia and Dyscalculia [J].
Alrubaian, Abdullah .
PSYCHIATRIC QUARTERLY, 2025,
[37]   Enhancing Alzheimer's Detection: Leveraging ADNI Data and Large Language Models for High-Accuracy Diagnosis [J].
Almalki, Hassan ;
Khadidos, Alaa O. ;
Alhebaishi, Nawaf .
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (11) :1363-1375
[38]   Appropriateness of Thyroid Nodule Cancer Risk Assessment and Management Recommendations Provided by Large Language Models [J].
Alarifi, Mohammad .
JOURNAL OF IMAGING INFORMATICS IN MEDICINE, 2025,
[39]   Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced [J].
Lorenzi, Andrea ;
Pugliese, Giorgia ;
Maniaci, Antonino ;
Lechien, Jerome R. ;
Allevi, Fabiana ;
Boscolo-Rizzo, Paolo ;
Vaira, Luigi Angelo ;
Saibene, Alberto Maria .
EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, 2024, 281 (09) :5001-5006
[40]   The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study [J].
Niriella, Madunil A. ;
Premaratna, Pathum ;
Senanayake, Mananjala ;
Kodisinghe, Senerath ;
Dassanayake, Uditha ;
Dassanayake, Anuradha ;
Ediriweera, Dileepa S. ;
de Silva, H. Janaka .
EXPERT REVIEW OF GASTROENTEROLOGY & HEPATOLOGY, 2025, 19 (04) :437-442