Evaluating ChatGPT Responses on Thyroid Nodules for Patient Education

被引：32

作者：

Campbell, Daniel J. ^{[1
,2
]}

Estephan, Leonard E. ^{[1
]}

Sina, Elliott M. ^{[1
]}

Mastrolonardo, Eric V. ^{[1
]}

Alapati, Rahul ^{[1
]}

Amin, Dev R. ^{[1
]}

Cottrill, Elizabeth E. ^{[1
]}

机构：

[1] Thomas Jefferson Univ Hosp, Dept Otolaryngol Head & Neck Surg, Philadelphia, PA USA

[2] Thomas Jefferson Univ Hosp, Dept Otolaryngol Head & Neck Surg, 925 Chestnut St,Floor 6, Philadelphia, PA 19107 USA

来源：

THYROID | 2024年 / 34卷 / 03期

关键词：

thyroid nodule; artificial intelligence; patient education; ChatGPT;

D O I：

10.1089/thy.2023.0491

中图分类号：

R5 [内科学];

学科分类号：

1002 ; 100201 ;

摘要：

Background: ChatGPT, an artificial intelligence (AI) chatbot, is the fastest growing consumer application in history. Given recent trends identifying increasing patient use of Internet sources for self-education, we seek to evaluate the quality of ChatGPT-generated responses for patient education on thyroid nodules. Methods: ChatGPT was queried 4 times with 30 identical questions. Queries differed by initial chatbot prompting: no prompting, patient-friendly prompting, 8th-grade level prompting, and prompting for references. Answers were scored on a hierarchical score: incorrect, partially correct, correct, or correct with references. Proportions of responses at incremental score thresholds were compared by prompt type using chi-squared analysis. Flesch-Kincaid grade level was calculated for each answer. The relationship between prompt type and grade level was assessed using analysis of variance. References provided within ChatGPT answers were totaled and analyzed for veracity. Results: Across all prompts (n = 120 questions), 83 answers (69.2%) were at least correct. Proportions of responses that were at least partially correct (p = 0.795) and correct (p = 0.402) did not differ by prompt; responses that were correct with references did (p < 0.0001). Responses from 8th-grade level prompting were the lowest mean grade level (13.43 +/- 2.86) and were significantly lower than no prompting (14.97 +/- 2.01, p = 0.01) and prompting for references (16.43 +/- 2.05, p < 0.0001). Prompting for references generated 80/80 (100%) of referenced medical publications within answers. Seventy references (87.5%) were legitimate citations, and 58/80 (72.5%) provided accurately reported information from the referenced publication. Conclusion: ChatGPT overall provides appropriate answers to most questions on thyroid nodules regardless of prompting. Despite targeted prompting strategies, ChatGPT reliably generates responses corresponding to grade levels well-above accepted recommendations for presenting medical information to patients. Significant rates of AI hallucination may preclude clinicians from recommending the current version of ChatGPT as an educational tool for patients at this time.

引用

页码：371 / 377

页数：7

共 24 条

[1] Alexander EK, 2022, LANCET DIABETES ENDO, V10, P533, DOI 10.1016/S2213-8587(22)00101-2
[2] Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References
Athaluri, Sai Anirudh
Manthena, Sandeep Varma
Kesapragada, V. S. R. Krishna Manoj
Yarlagadda, Vineel
Dave, Tirth
Duddumpudi, Rama Tulasi Siri
[J]. CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (04)
[3] Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
Ayers, John W.
Poliak, Adam
Dredze, Mark
Leas, Eric C.
Zhu, Zechariah
Kelley, Jessica B.
Faix, Dennis J.
Goodman, Aaron M.
Longhurst, Christopher A.
Hogarth, Michael
Smith, Davey M.
[J]. JAMA INTERNAL MEDICINE, 2023, 183 (06) : 589 - 596
[4] Comparison Between ChatGPT and Google Search as Sources of Postoperative Patient Instructions
Ayoub, Noel F.
Lee, Yu-Jin
Grimm, David
Balakrishnan, Karthik
[J]. JAMA OTOLARYNGOLOGY-HEAD & NECK SURGERY, 2023, 149 (06) : 556 - +
[5] Reading Grade Level and Completeness of Freely Available Materials on Thyroid Nodules: There Is Work to Be Done
Barnes, J. Aaron
Davies, Louise
[J]. THYROID, 2015, 25 (02) : 147 - 156
[6] Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations
Bhayana, Rajesh
Krishna, Satheesh
Bleakney, Robert R.
[J]. RADIOLOGY, 2023, 307 (05)
[7] Evaluating ChatGPT responses on obstructive sleep apnea for patient education
Campbell, Daniel J.
Estephan, Leonard E.
Mastrolonardo, Eric V.
Amin, Dev R.
Huntley, Colin T.
Boon, Maurits S.
[J]. JOURNAL OF CLINICAL SLEEP MEDICINE, 2023, 19 (12): : 1989 - 1995
[8] Online health information on thyroid nodules: do patients understand them?
Cimbek, Emine A.
Cimbek, Ahmet
[J]. MINERVA ENDOCRINOLOGY, 2023,
[9] Epidemiology of thyroid nodules
Dean, Diana S.
Gharib, Hossein
[J]. BEST PRACTICE & RESEARCH CLINICAL ENDOCRINOLOGY & METABOLISM, 2008, 22 (06) : 901 - 911
[10] Online Health Information Seeking Among US Adults: Measuring Progress Toward a Healthy People 2020 Objective
Finney Rutten, Lila J.
Blake, Kelly D.
Greenberg-Worisek, Alexandra J.
Allen, Summer V.
Moser, Richard P.
Hesse, Bradford W.
[J]. PUBLIC HEALTH REPORTS, 2019, 134 (06) : 617 - 625

← 1 2 3 →