Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard

被引:6
|
作者
Lang, Siegmund Philipp [1 ,2 ]
Yoseph, Ezra Tilahun [1 ]
Gonzalez-Suarez, Aneysis D. [1 ]
Kim, Robert [1 ]
Fatemi, Parastou [3 ]
Wagner, Katherine [4 ]
Maldaner, Nicolai [1 ,5 ,6 ]
Stienen, Martin N. [7 ,8 ,9 ]
Zygourakis, Corinna Clio [1 ]
机构
[1] Stanford Univ, Dept Neurosur, Sch Med, Stanford, CA USA
[2] Univ Hosp Regensburg, Dept Trauma Surg, Regensburg, Germany
[3] Cleveland Clin, Dept Neurosurg, Cleveland, OH USA
[4] Ventura Neurosurg, Ventura, CA USA
[5] Univ Hosp Zurich, Dept Neurosurg, Zurich, Switzerland
[6] Univ Zurich, Clin Neurosci Ctr, Zurich, Switzerland
[7] Cantonal Hosp St Gallen, Department Neurosurg, St Gallen, Switzerland
[8] Cantonal Hosp, Spine Ctr Eastern Switzerland, St Gallen, Switzerland
[9] Med Sch St Gallen, St Gallen, Switzerland
关键词
Artificial intelligence; Large language models; Patient education; Lumbar spine fusion; ChatGPT; Bard; COMPLICATION;
D O I
10.14245/ns.2448098.049
中图分类号
R74 [神经病学与精神病学];
学科分类号
摘要
Objective: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education. Methods: Our study aims to assess the response quality of Open AI (artificial intelligence)'s ChatGPT 3.5 and Google's Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from 'unsatisfactory' to 'excellent.' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale. Results: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard's responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k= 0.041, p= 0.622; Bard: k=-0.040, p= 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism. Conclusion: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs' role in medical education and healthcare communication.
引用
收藏
页码:633 / 641
页数:9
相关论文
共 12 条
  • [1] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
    Du, Wei
    Jin, Xueting
    Harris, Jaryse Carol
    Brunetti, Alessandro
    Johnson, Erika
    Leung, Olivia
    Li, Xingchen
    Walle, Selemon
    Yu, Qing
    Zhou, Xiao
    Bian, Fang
    Mckenzie, Kajanna
    Kanathanavanich, Manita
    Ozcelik, Yusuf
    El-Sharkawy, Farah
    Koga, Shunsuke
    ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73
  • [2] Large Language Models for Intraoperative Decision Support in Plastic Surgery: A Comparison between ChatGPT-4 and Gemini
    Gomez-Cabello, Cesar A.
    Borna, Sahar
    Pressman, Sophia M.
    Haider, Syed Ali
    Forte, Antonio J.
    MEDICINA-LITHUANIA, 2024, 60 (06):
  • [3] Mapping the Landscape of Generative Language Models in Dental Education: A Comparison Between ChatGPT and Google Bard
    Aldukhail, Shaikha
    EUROPEAN JOURNAL OF DENTAL EDUCATION, 2025, 29 (01) : 136 - 148
  • [4] Benchmarking State-of-the-Art Large Language Models for Migraine Patient Education: Performance Comparison of Responses to Common Queries
    Li, Linger
    Li, Pengfei
    Wang, Kun
    Zhang, Liang
    Ji, Hongwei
    Zhao, Hongqin
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [5] Assessing the accuracy, usefulness, and readability of artificialintelligence- generated responses to common dermatologic surgery questions for patient education: A double-blinded comparative study of ChatGPT and Google Bard
    Robinson, Michelle A.
    Belzberg, Micah
    Thakker, Sach
    Bibee, Kristin
    Merkel, Emily
    Macfarlane, Deborah F.
    Lim, Jordan
    Scott, Jeffrey F.
    Deng, Min
    Lewin, Jesse
    Soleymani, David
    Rosenfeld, David
    Liu, Rosemarie
    Liu, Tin Yan Alvin
    Ng, Elise
    JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY, 2024, 90 (05) : 1078 - 1080
  • [6] Use of generative large language models for patient education on common surgical conditions: a comparative analysis between ChatGPT and Google Gemini
    El Senbawy, Omar Mahmoud
    Patel, Keval Bhavesh
    Wannakuwatte, Randev Ayodhya
    Thota, Akhila N.
    UPDATES IN SURGERY, 2025,
  • [7] Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy
    Tepe, Murat
    Emekli, Emre
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (05)
  • [8] Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced
    Lorenzi, Andrea
    Pugliese, Giorgia
    Maniaci, Antonino
    Lechien, Jerome R.
    Allevi, Fabiana
    Boscolo-Rizzo, Paolo
    Vaira, Luigi Angelo
    Saibene, Alberto Maria
    EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, 2024, 281 (09) : 5001 - 5006
  • [9] Patient perspectives on AI: a pilot study comparing large language model and physician-generated responses to routine cervical spine surgery questions
    Yoseph, Ezra T.
    Gonzalez-Suarez, Aneysis D.
    Lang, Siegmund
    Desai, Atman
    Hu, Serena S.
    Zygourakis, Corinna C.
    ARTIFICIAL INTELLIGENCE SURGERY, 2024, 4 (03): : 267 - 277
  • [10] Evaluation of Responses to Questions About Keratoconus Using ChatGPT-4.0, Google Gemini and Microsoft Copilot: A Comparative Study of Large Language Models on Keratoconus
    Demir, Suleyman
    EYE & CONTACT LENS-SCIENCE AND CLINICAL PRACTICE, 2025, 51 (03): : e107 - e111