Performance of ChatGPT versus Google Bard on Answering Postgraduate-Level Surgical Examination Questions: A Meta-Analysis

被引：1

作者：

Andrew, Albert ^{[1
]}

Zhao, Sunny ^{[1
]}

机构：

[1] Univ Auckland, Fac Med & Hlth Sci, Sch Med, Auckland, New Zealand

来源：

INDIAN JOURNAL OF SURGERY | 2025年

关键词：

Large language models; ChatGPT; Google Bard; Examination performance; Meta analysis;

D O I：

10.1007/s12262-025-04296-x

中图分类号：

R61 [外科手术学];

学科分类号：

摘要：

Objective To evaluate and compare the performance (or accuracy) of publicly available large language models (ChatGPT-4.0, ChatGPT-3.5, and Google Bard) in answering multiple-choice postgraduate-level surgical examination questions. Methods A search was conducted on PubMed/MEDLINE and the Cochrane Library for studies that compared the accuracy of ChatGPT and Google Bard in the context of multiple-choice postgraduate-level surgical examination questions. A random-effects model was used for statistical analysis to estimate and compare the pooled accuracy of the large language models, with results reported as a 95% confidence interval (CI). Heterogeneity was assessed using the I2 statistic and publication bias was evaluated through funnel plots and Egger's test. Statistical significance was set at P < 0.05. Results The full text of 12 studies published between 2023 and 2024 was reviewed, and data extraction was conducted to compare the performance of ChatGPT (GPT-3.5 or GPT-4.0) with Google Bard (rebranded as Gemini). ChatGPT-4.0 exhibited the highest accuracy, with a pooled accuracy of 73% (95% CI: 0.65-0.80, P < 0.01, I2 = 94%). No statistically significant difference was observed when comparing ChatGPT-3.5 with Google Bard (OR: 0.98, 95% CI: 0.8-1.21, P = 0.88, I2 = 67%) which showed that both models performed at a similar level. A statistically significant difference was found when comparing ChatGPT-4.0 with Google Bard (OR: 2.25, 95% CI: 1.73-2.91, P < 0.01, I2 = 78%), showing that ChatGPT-4.0 demonstrated superior performance. Conclusion This meta-analysis highlighted the strong potential of large language models to pass postgraduate-level surgical examinations. Of the three large language models, ChatGPT-4.0 demonstrated the best accuracy, while Google Bard showed the most inconsistent performance, scoring under 50% in 4 of the 12 studies analyzed. These findings suggested that large language models, particularly ChatGPT-4.0, could apply surgical knowledge to solve problems, with the potential for future applications in medical education and patient care.

引用

页数：10

共 32 条

[1]

Ali Rohaid, 2023, Neurosurgery, V93, P1090, DOI [10.1227/neu.0000000000002551, 10.1227/neu.0000000000002551]

[2] Assessing Readability of Patient Education Materials: Current Role in Orthopaedics [J].

Badarudeen, Sameer ;

Sabharwal, Sanjeev .

CLINICAL ORTHOPAEDICS AND RELATED RESEARCH, 2010, 468 (10) :2572-2580

[3] Artificial intelligence in medical science: a review [J].

Bindra, Simrata ;

Jain, Richa .

IRISH JOURNAL OF MEDICAL SCIENCE, 2024, 193 (03) :1419-1429

[4]

Bryan Darren S, 2024, J Thorac Cardiovasc Surg, DOI [10.1016/j.jtcvs.2024.11.006, 10.1016/j.jtcvs.2024.11.006]

[5] Artificial intelligence in medicine: current trends and future possibilities [J].

Buch, Varun H. ;

Ahmed, Irfan ;

Maruthappu, Mahiben .

BRITISH JOURNAL OF GENERAL PRACTICE, 2018, 68 (668) :143-144

[6] The performance of large language models in intercollegiate Membership of the Royal College of Surgeons examination [J].

Chan, J. ;

Dong, T. ;

Angelini, G. D. .

ANNALS OF THE ROYAL COLLEGE OF SURGEONS OF ENGLAND, 2024, 106 (08) :700-704

[7] Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals [J].

Choudhury, Avishek ;

Chaudhry, Zaira .

JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26

[8] The future landscape of large language models in medicine [J].

Clusmann, Jan ;

Kolbinger, Fiona R. ;

Muti, Hannah Sophie ;

Carrero, Zunamys I. ;

Eckardt, Jan-Niklas ;

Laleh, Narmin Ghaffari ;

Loeffler, Chiara Maria Lavinia ;

Schwarzkopf, Sophie-Caroline ;

Unger, Michaela ;

Veldhuizen, Gregory P. ;

Wagner, Sophia J. ;

Kather, Jakob Nikolas .

COMMUNICATIONS MEDICINE, 2023, 3 (01)

[9]

Davenport Thomas, 2019, Future Healthc J, V6, P94, DOI [10.7861/futurehosp.6-2-94, 10.2139/ssrn.3525037, 10.7861/futurehosp.6-2-94]

[10] Unveiling the Potential of AI in Plastic Surgery Education: A Comparative Study of Leading AI Platforms' Performance on In-training Examinations [J].

DiDonna, Nicole ;

Shetty, Pragna N. ;

Khan, Kamran ;

Damitz, Lynn .

PLASTIC AND RECONSTRUCTIVE SURGERY-GLOBAL OPEN, 2024, 12 (06)

← 1 2 3 4 →