Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study

被引:15
|
作者
Ye, Carrie [1 ]
Zweck, Elric [2 ]
Ma, Zechen [1 ]
Smith, Justin [1 ]
Katz, Steven [1 ]
机构
[1] Univ Alberta, Edmonton, AB, Canada
[2] Univ Hosp Dusseldorf, Dusseldorf, Germany
关键词
IMPACT;
D O I
10.1002/art.42737
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Objective: The objective of the current study was to assess the quality of large language model (LLM) chatbot versus physician-generated responses to patient-generated rheumatology questions.Methods: We conducted a single-center cross-sectional survey of rheumatology patients (n = 17) in Edmonton, Alberta, Canada. Patients evaluated LLM chatbot versus physician-generated responses for comprehensiveness and readability, with four rheumatologists also evaluating accuracy by using a Likert scale from 1 to 10 (1 being poor, 10 being excellent).Results: Patients rated no significant difference between artificial intelligence (AI) and physician-generated responses in comprehensiveness (mean 7.12 +/- SD 0.99 vs 7.52 +/- 1.16; P = 0.1962) or readability (7.90 +/- 0.90 vs 7.80 +/- 0.75; P = 0.5905). Rheumatologists rated AI responses significantly poorer than physician responses on comprehensiveness (AI 5.52 +/- 2.13 vs physician 8.76 +/- 1.07; P < 0.0001), readability (AI 7.85 +/- 0.92 vs physician 8.75 +/- 0.57; P = 0.0003), and accuracy (AI 6.48 +/- 2.07 vs physician 9.08 +/- 0.64; P < 0.0001). The proportion of preference to AI- versus physician-generated responses by patients and physicians was 0.45 +/- 0.18 and 0.15 +/- 0.08, respectively (P = 0.0106). After learning that one answer for each question was AI generated, patients were able to correctly identify AI-generated answers at a lower proportion compared to physicians (0.49 +/- 0.26 vs 0.97 +/- 0.04; P = 0.0183). The average word count of AI answers was 69.10 +/- 25.35 words, as compared to 98.83 +/- 34.58 words for physician-generated responses (P = 0.0008).Conclusion: Rheumatology patients rated AI-generated responses to patient questions similarly to physician-generated responses in terms of comprehensiveness, readability, and overall preference. However, rheumatologists rated AI responses significantly poorer than physician-generated responses, suggesting that LLM chatbot responses are inferior to physician responses, a difference that patients may not be aware of.
引用
收藏
页码:479 / 484
页数:6
相关论文
共 50 条
  • [1] Doctor versus artificial intelligence: patient and physician evaluation of large language model responses to rheumatology patient questions: comment on the article by Ye et al Reply
    Ye, Carrie
    ARTHRITIS & RHEUMATOLOGY, 2024, 76 (06) : 984 - 985
  • [2] The Effect of Artificial Intelligence on Patient-Physician Trust: Cross-Sectional Vignette Study
    Zondag, Anna G. M.
    Rozestraten, Raoul
    Grimmelikhuijsen, Stephan G.
    Jongsma, Karin R.
    van Solinge, Wouter W.
    Bots, Michiel L.
    Vernooij, Robin W. M.
    Haitjema, Saskia
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [3] Evaluation of responses to cardiac imaging questions by the artificial intelligence large language model ChatGPT
    Monroe, Cynthia L.
    Abdelhafez, Yasser G.
    Atsina, Kwame
    Aman, Edris
    Nardo, Lorenzo
    Madani, Mohammad H.
    CLINICAL IMAGING, 2024, 112
  • [4] A Cross-sectional Study of Patient Perspectives on Artificial Intelligence: A Comparison of Somatic Versus Mental Health Care
    Benda, Natalie C.
    Harkins, Sarah E.
    Hermann, Alison
    Pathak, Jyotishman
    Kim, Jessica
    Zhao, Yihong
    Turchioe, Meghan Reading
    JOURNAL OF GENERAL INTERNAL MEDICINE, 2025, 40 (04) : 977 - 979
  • [5] Comparing physician and large language model responses to influenza patient questions in the online health community
    Wu, Hong
    Li, Mingyu
    Zhang, Li
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2025, 197
  • [6] Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis
    He, Wenjie
    Zhang, Wenyan
    Jin, Ya
    Zhou, Qiang
    Zhang, Huadan
    Xia, Qing
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [7] Comment on "Evaluation of responses to cardiac imaging questions by the artificial intelligence large language model ChatGPT"
    Lone, Mohd Rafi
    Sohail, Shahab Saquib
    CLINICAL IMAGING, 2024, 114
  • [8] Assessing the ability of an artificial intelligence chatbot to translate dermatopathology reports into patient-friendly language: A cross-sectional study
    Zhang, Yuying
    Chen, Ryan
    Nguyen, Dan
    Choi, Stephanie
    Gabel, Colleen
    Leonard, Nicholas
    Yim, Kaitlyn
    O'Donnell, Patrick
    Elaba, Zendee
    Deng, April
    Levin, Nikki A.
    JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY, 2024, 90 (02) : 397 - 399
  • [9] Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivityof Large Language Models to Surgical Patient Questions:Cross-Sectional Study
    Dagli, Mert Marcel
    Oettl, Felix Conrad
    Ujral, Jaskeerat
    Malhotra, Kashish
    Ghenbot, Yohannes
    Yoon, Jang W.
    Ozturk, Ali K.
    Welch, William C.
    JMIR FORMATIVE RESEARCH, 2024, 8
  • [10] Physician and Patient Views on Public Physician Rating Websites: A Cross-Sectional Study
    Alison M. Holliday
    Allen Kachalia
    Gregg S. Meyer
    Thomas D. Sequist
    Journal of General Internal Medicine, 2017, 32 : 626 - 631