Performance of ChatGPT versus Google Bard on Answering Postgraduate-Level Surgical Examination Questions: A Meta-Analysis

被引:1
作者
Andrew, Albert [1 ]
Zhao, Sunny [1 ]
机构
[1] Univ Auckland, Fac Med & Hlth Sci, Sch Med, Auckland, New Zealand
关键词
Large language models; ChatGPT; Google Bard; Examination performance; Meta analysis;
D O I
10.1007/s12262-025-04296-x
中图分类号
R61 [外科手术学];
学科分类号
摘要
Objective To evaluate and compare the performance (or accuracy) of publicly available large language models (ChatGPT-4.0, ChatGPT-3.5, and Google Bard) in answering multiple-choice postgraduate-level surgical examination questions. Methods A search was conducted on PubMed/MEDLINE and the Cochrane Library for studies that compared the accuracy of ChatGPT and Google Bard in the context of multiple-choice postgraduate-level surgical examination questions. A random-effects model was used for statistical analysis to estimate and compare the pooled accuracy of the large language models, with results reported as a 95% confidence interval (CI). Heterogeneity was assessed using the I2 statistic and publication bias was evaluated through funnel plots and Egger's test. Statistical significance was set at P < 0.05. Results The full text of 12 studies published between 2023 and 2024 was reviewed, and data extraction was conducted to compare the performance of ChatGPT (GPT-3.5 or GPT-4.0) with Google Bard (rebranded as Gemini). ChatGPT-4.0 exhibited the highest accuracy, with a pooled accuracy of 73% (95% CI: 0.65-0.80, P < 0.01, I2 = 94%). No statistically significant difference was observed when comparing ChatGPT-3.5 with Google Bard (OR: 0.98, 95% CI: 0.8-1.21, P = 0.88, I2 = 67%) which showed that both models performed at a similar level. A statistically significant difference was found when comparing ChatGPT-4.0 with Google Bard (OR: 2.25, 95% CI: 1.73-2.91, P < 0.01, I2 = 78%), showing that ChatGPT-4.0 demonstrated superior performance. Conclusion This meta-analysis highlighted the strong potential of large language models to pass postgraduate-level surgical examinations. Of the three large language models, ChatGPT-4.0 demonstrated the best accuracy, while Google Bard showed the most inconsistent performance, scoring under 50% in 4 of the 12 studies analyzed. These findings suggested that large language models, particularly ChatGPT-4.0, could apply surgical knowledge to solve problems, with the potential for future applications in medical education and patient care.
引用
收藏
页数:10
相关论文
共 32 条
[1]  
Ali Rohaid, 2023, Neurosurgery, V93, P1090, DOI [10.1227/neu.0000000000002551, 10.1227/neu.0000000000002551]
[2]   Assessing Readability of Patient Education Materials: Current Role in Orthopaedics [J].
Badarudeen, Sameer ;
Sabharwal, Sanjeev .
CLINICAL ORTHOPAEDICS AND RELATED RESEARCH, 2010, 468 (10) :2572-2580
[3]   Artificial intelligence in medical science: a review [J].
Bindra, Simrata ;
Jain, Richa .
IRISH JOURNAL OF MEDICAL SCIENCE, 2024, 193 (03) :1419-1429
[4]  
Bryan Darren S, 2024, J Thorac Cardiovasc Surg, DOI [10.1016/j.jtcvs.2024.11.006, 10.1016/j.jtcvs.2024.11.006]
[5]   Artificial intelligence in medicine: current trends and future possibilities [J].
Buch, Varun H. ;
Ahmed, Irfan ;
Maruthappu, Mahiben .
BRITISH JOURNAL OF GENERAL PRACTICE, 2018, 68 (668) :143-144
[6]   The performance of large language models in intercollegiate Membership of the Royal College of Surgeons examination [J].
Chan, J. ;
Dong, T. ;
Angelini, G. D. .
ANNALS OF THE ROYAL COLLEGE OF SURGEONS OF ENGLAND, 2024, 106 (08) :700-704
[7]   Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals [J].
Choudhury, Avishek ;
Chaudhry, Zaira .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[8]   The future landscape of large language models in medicine [J].
Clusmann, Jan ;
Kolbinger, Fiona R. ;
Muti, Hannah Sophie ;
Carrero, Zunamys I. ;
Eckardt, Jan-Niklas ;
Laleh, Narmin Ghaffari ;
Loeffler, Chiara Maria Lavinia ;
Schwarzkopf, Sophie-Caroline ;
Unger, Michaela ;
Veldhuizen, Gregory P. ;
Wagner, Sophia J. ;
Kather, Jakob Nikolas .
COMMUNICATIONS MEDICINE, 2023, 3 (01)
[9]  
Davenport Thomas, 2019, Future Healthc J, V6, P94, DOI [10.7861/futurehosp.6-2-94, 10.2139/ssrn.3525037, 10.7861/futurehosp.6-2-94]
[10]   Unveiling the Potential of AI in Plastic Surgery Education: A Comparative Study of Leading AI Platforms' Performance on In-training Examinations [J].
DiDonna, Nicole ;
Shetty, Pragna N. ;
Khan, Kamran ;
Damitz, Lynn .
PLASTIC AND RECONSTRUCTIVE SURGERY-GLOBAL OPEN, 2024, 12 (06)