Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability

被引:11
作者
Cao, Jennie J. [1 ]
Kwon, Daniel H. [2 ]
Ghaziani, Tara T. [3 ]
Kwo, Paul [3 ]
Tse, Gary [4 ]
Kesselman, Andrew [5 ]
Kamaya, Aya [1 ]
Tse, Justin R. [1 ]
机构
[1] Stanford Univ, Sch Med, Dept Radiol, 300 Pasteur Dr,Room H-1307, Stanford, CA 94305 USA
[2] Univ Calif San Francisco, San Francisco Sch Med, Dept Med, 505 Parnassus Ave,MC1286C, San Francisco, CA 94144 USA
[3] Stanford Univ, Dept Med, Sch Med, 430 Broadway St MC 6341, Redwood City, CA 94063 USA
[4] Univ Calif Los Angeles, Los Angeles David Geffen Sch Med, Dept Radiol Sci, 757 Westwood Plaza Los Angeles, Los Angeles, CA 90095 USA
[5] Stanford Univ, Sch Med, Dept Urol, 875 Blake Wilbur Dr Palo Alto, Stanford, CA 94304 USA
关键词
Liver cancer; Hepatocellular carcinoma; Artificial intelligence; Large language model; CHATGPT;
D O I
10.1007/s00261-024-04501-7
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Purpose To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management. Methods Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests. Results Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001). Conclusion Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.
引用
收藏
页码:4286 / 4294
页数:9
相关论文
共 47 条
[21]   Accuracy and consistency of publicly available Large Language Models as clinical decision support tools for the management of colon cancer [J].
Kaiser, Kristen N. ;
Hughes, Alexa J. ;
Yang, Anthony D. ;
Turk, Anita A. ;
Mohanty, Sanjay ;
Gonzalez, Andrew A. ;
Patzer, Rachel E. ;
Bilimoria, Karl Y. ;
Ellis, Ryan J. .
JOURNAL OF SURGICAL ONCOLOGY, 2024, 130 (05) :1104-1110
[22]   Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard [J].
Lang, Siegmund Philipp ;
Yoseph, Ezra Tilahun ;
Gonzalez-Suarez, Aneysis D. ;
Kim, Robert ;
Fatemi, Parastou ;
Wagner, Katherine ;
Maldaner, Nicolai ;
Stienen, Martin N. ;
Zygourakis, Corinna Clio .
NEUROSPINE, 2024, 21 (02) :633-641
[23]   Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivityof Large Language Models to Surgical Patient Questions:Cross-Sectional Study [J].
Dagli, Mert Marcel ;
Oettl, Felix Conrad ;
Ujral, Jaskeerat ;
Malhotra, Kashish ;
Ghenbot, Yohannes ;
Yoon, Jang W. ;
Ozturk, Ali K. ;
Welch, William C. .
JMIR FORMATIVE RESEARCH, 2024, 8
[24]   Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis [J].
Wu, Jo-Hsuan ;
Nishida, Takashi ;
Liu, T. Y. Alvin .
ASIA-PACIFIC JOURNAL OF OPHTHALMOLOGY, 2024, 13 (05)
[25]   Using Large Language Models in the Diagnosis of Acute Cholecystitis: Assessing Accuracy and Guidelines Compliance [J].
Goglia, Marta ;
Cicolani, Arianna ;
Carrano, Francesco Maria ;
Petrucciani, Niccolo ;
D'Angelo, Francesco ;
Pace, Marco ;
Chiarini, Lucio ;
Silecchia, Gianfranco ;
Aurello, Paolo .
AMERICAN SURGEON, 2025, 91 (06) :967-977
[26]   Accuracy and reliability of large language models in assessing learning outcomes achievement across cognitive domains [J].
Teckwani, Swapna Haresh ;
Wong, Amanda Huee-Ping ;
Luke, Nathasha Vihangi ;
Low, Ivan Cherh Chiet .
ADVANCES IN PHYSIOLOGY EDUCATION, 2024, 48 (04) :904-914
[27]   Comparative Analysis of Large Language Models for Answering Cancer-Related Questions in Korean [J].
Chang, Hyun ;
Jung, Jin-Woo ;
Kim, Yongho .
YONSEI MEDICAL JOURNAL, 2025, 66 (07) :405-411
[28]   Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context [J].
Piao, Ying ;
Chen, Hongtao ;
Wu, Shihai ;
Li, Xianming ;
Li, Zihuang ;
Yang, Dong .
DIGITAL HEALTH, 2024, 10
[29]   Readability, Reliability, and Quality Analysis of Internet-Based Patient Education Materials and Large Language Models on Meniere's Disease [J].
Alamleh, Salahaldin ;
Mavedatnia, Dorsa ;
Francis, Gizelle ;
Le, Trung ;
Davies, Joel ;
Lin, Vincent ;
Lee, John J. W. .
JOURNAL OF OTOLARYNGOLOGY-HEAD & NECK SURGERY, 2025, 54
[30]   Utilizing large language models in breast cancer management: systematic review [J].
Vera Sorin ;
Benjamin S. Glicksberg ;
Yaara Artsi ;
Yiftach Barash ;
Eli Konen ;
Girish N. Nadkarni ;
Eyal Klang .
Journal of Cancer Research and Clinical Oncology, 150