The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study

被引：0

作者：

Niriella, Madunil A. ^{[1
]}

Premaratna, Pathum ^{[1
]}

Senanayake, Mananjala ^{[2
]}

Kodisinghe, Senerath ^{[3
]}

Dassanayake, Uditha ^{[1
]}

Dassanayake, Anuradha ^{[1
]}

Ediriweera, Dileepa S. ^{[1
]}

de Silva, H. Janaka ^{[1
]}

机构：

[1] Univ Kelaniya, Fac Med, Ragama, Sri Lanka

[2] Dist Gen Hosp, Gastroenterol Unit, Negombo, Sri Lanka

[3] Dist Gen Hosp, Gastroenterol Unit, Matara, Sri Lanka

来源：

EXPERT REVIEW OF GASTROENTEROLOGY & HEPATOLOGY | 2025年

关键词：

Artificial intelligence; large language model; AI; LLM; liver disease; patient information;

D O I：

10.1080/17474124.2025.2471874

中图分类号：

R57 [消化系及腹部疾病];

学科分类号：

摘要：

BackgroundWe assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information.Research design and methodsWe compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response.ResultsThe expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, p = 0.811], completeness [H(2) = 3.146, p = 0.207], or quality [H(2) = 3.350, p = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, p = 0.062], completeness [H(2) = 0.104, p = 0.949], or quality [H(2) = 0.420, p = 0.810] between the three raters (R1, R2, R3).ConclusionOur findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.

引用

页码：437 / 442

页数：6

共 28 条

[1] Gilson A., Safranek C.W., Huang T., Et al., How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment, JMIR Med Educ, 9, (2023)
[2] Orru G., Piarulli A., Conversano C., Et al., Human-like problem-solving abilities in large language models using ChatGPT, Front Artif Intell, 6, (2023)
[3] Meera S., Geerthik S., Natural language processing, Artif Intell Techniques Wireless Commun Netw, 2022, pp. 139-153
[4] Moor M., Banerjee O., Abad Z.S.H., Et al., Foundation models for generalist medical artificial intelligence, Nat, 616, 7956, pp. 259-265, (2023)
[5] Milne-Ives M., de Cock C., Lim E., Et al., The effectiveness of artificial intelligence conversational agents in health care: systematic review, J Med Internet Res, 22, 10, (2020)
[6] Kung T.H., Cheatham M., Medenilla A., Et al., Performance of ChatGPT on USMLE: potential for ai-assisted medical education using large language models, PLOS Digit Health, 2, 2, (2023)
[7] Douglas P., Howe L., Fay N., Et al., ChatGPT’s advice is perceived as better than that of professional advice columnists, (2023)
[8] Pugliese N., Wai-Sun Wong V., Schattenberg J.M., Et al., Accuracy, reliability, and comprehensibility of ChatGPT-generated medical responses for patients with nonalcoholic fatty liver disease, Clin Gastroenterol Hepatol, 22, 4, pp. 886-889, (2024)
[9] Lim Z.W., Pushpanathan K., Yew S.M.E., Et al., Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard, EBioMedicine, (2023)
[10] Yeo Y.H., Samaan J.S., Ng W.H., Et al., Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma, Clin Mol Hepatol, 29, 3, pp. 721-732, (2023)

← 1 2 3 →