Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard

被引：6

作者：

Lang, Siegmund Philipp ^{[1
,2
]}

Yoseph, Ezra Tilahun ^{[1
]}

Gonzalez-Suarez, Aneysis D. ^{[1
]}

Kim, Robert ^{[1
]}

Fatemi, Parastou ^{[3
]}

Wagner, Katherine ^{[4
]}

Maldaner, Nicolai ^{[1
,5
,6
]}

Stienen, Martin N. ^{[7
,8
,9
]}

Zygourakis, Corinna Clio ^{[1
]}

机构：

[1] Stanford Univ, Dept Neurosur, Sch Med, Stanford, CA USA

[2] Univ Hosp Regensburg, Dept Trauma Surg, Regensburg, Germany

[3] Cleveland Clin, Dept Neurosurg, Cleveland, OH USA

[4] Ventura Neurosurg, Ventura, CA USA

[5] Univ Hosp Zurich, Dept Neurosurg, Zurich, Switzerland

[6] Univ Zurich, Clin Neurosci Ctr, Zurich, Switzerland

[7] Cantonal Hosp St Gallen, Department Neurosurg, St Gallen, Switzerland

[8] Cantonal Hosp, Spine Ctr Eastern Switzerland, St Gallen, Switzerland

[9] Med Sch St Gallen, St Gallen, Switzerland

来源：

NEUROSPINE | 2024年 / 21卷 / 02期

关键词：

Artificial intelligence; Large language models; Patient education; Lumbar spine fusion; ChatGPT; Bard; COMPLICATION;

D O I：

10.14245/ns.2448098.049

中图分类号：

R74 [神经病学与精神病学];

学科分类号：

摘要：

Objective: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education. Methods: Our study aims to assess the response quality of Open AI (artificial intelligence)'s ChatGPT 3.5 and Google's Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from 'unsatisfactory' to 'excellent.' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale. Results: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard's responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k= 0.041, p= 0.622; Bard: k=-0.040, p= 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism. Conclusion: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs' role in medical education and healthcare communication.

引用

页码：633 / 641

页数：9

共 12 条

[1] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
Du, Wei
Jin, Xueting
Harris, Jaryse Carol
Brunetti, Alessandro
Johnson, Erika
Leung, Olivia
Li, Xingchen
Walle, Selemon
Yu, Qing
Zhou, Xiao
Bian, Fang
Mckenzie, Kajanna
Kanathanavanich, Manita
Ozcelik, Yusuf
El-Sharkawy, Farah
Koga, Shunsuke
ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73
[2] Large Language Models for Intraoperative Decision Support in Plastic Surgery: A Comparison between ChatGPT-4 and Gemini
Gomez-Cabello, Cesar A.
Borna, Sahar
Pressman, Sophia M.
Haider, Syed Ali
Forte, Antonio J.
MEDICINA-LITHUANIA, 2024, 60 (06):
[3] Mapping the Landscape of Generative Language Models in Dental Education: A Comparison Between ChatGPT and Google Bard
Aldukhail, Shaikha
EUROPEAN JOURNAL OF DENTAL EDUCATION, 2025, 29 (01) : 136 - 148
[4] Benchmarking State-of-the-Art Large Language Models for Migraine Patient Education: Performance Comparison of Responses to Common Queries
Li, Linger
Li, Pengfei
Wang, Kun
Zhang, Liang
Ji, Hongwei
Zhao, Hongqin
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[5] Assessing the accuracy, usefulness, and readability of artificialintelligence- generated responses to common dermatologic surgery questions for patient education: A double-blinded comparative study of ChatGPT and Google Bard
Robinson, Michelle A.
Belzberg, Micah
Thakker, Sach
Bibee, Kristin
Merkel, Emily
Macfarlane, Deborah F.
Lim, Jordan
Scott, Jeffrey F.
Deng, Min
Lewin, Jesse
Soleymani, David
Rosenfeld, David
Liu, Rosemarie
Liu, Tin Yan Alvin
Ng, Elise
JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY, 2024, 90 (05) : 1078 - 1080
[6] Use of generative large language models for patient education on common surgical conditions: a comparative analysis between ChatGPT and Google Gemini
El Senbawy, Omar Mahmoud
Patel, Keval Bhavesh
Wannakuwatte, Randev Ayodhya
Thota, Akhila N.
UPDATES IN SURGERY, 2025,
[7] Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy
Tepe, Murat
Emekli, Emre
CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (05)
[8] Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced
Lorenzi, Andrea
Pugliese, Giorgia
Maniaci, Antonino
Lechien, Jerome R.
Allevi, Fabiana
Boscolo-Rizzo, Paolo
Vaira, Luigi Angelo
Saibene, Alberto Maria
EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, 2024, 281 (09) : 5001 - 5006
[9] Patient perspectives on AI: a pilot study comparing large language model and physician-generated responses to routine cervical spine surgery questions
Yoseph, Ezra T.
Gonzalez-Suarez, Aneysis D.
Lang, Siegmund
Desai, Atman
Hu, Serena S.
Zygourakis, Corinna C.
ARTIFICIAL INTELLIGENCE SURGERY, 2024, 4 (03): : 267 - 277
[10] Evaluation of Responses to Questions About Keratoconus Using ChatGPT-4.0, Google Gemini and Microsoft Copilot: A Comparative Study of Large Language Models on Keratoconus
Demir, Suleyman
EYE & CONTACT LENS-SCIENCE AND CLINICAL PRACTICE, 2025, 51 (03): : e107 - e111

← 1 2 →