How does ChatGPT perform on the European Board of Pediatric Surgery examination? A randomized comparative study

被引：2

作者：

Azizoglu, Mustafa ^{[1
]}

Aydogdu, Bahattin ^{[2
]}

机构：

[1] Dicle Univ, Med Sch, Dept Pediat Surg, Diyarbakir, Turkiye

[2] Balikesir Univ, Dept Pediat Surg, Balikesir, Turkiye

来源：

MEDICINA BALEAR | 2024年 / 39卷 / 01期

关键词：

ChatGPT; Pediatric Surgery; exam; questions; artificial intelligence;

D O I：

10.3306/AJHS.2024.39.01.23

中图分类号：

R5 [内科学];

学科分类号：

1002 ; 100201 ;

摘要：

Purpose: The purpose of this study was to conduct a detailed comparison of the accuracy and responsiveness of GPT-3.5 and GPT-4 in the realm of pediatric surgery. Specifically, we sought to assess their ability to correctly answer a series of sample questions of European Board of Pediatric Surgery (EBPS) exam. Methods: This study was conducted between 20 May 2023 and 30 May 2023. This study undertook a comparative analysis of two AI language models, GPT-3.5 and GPT-4, in the field of pediatric surgery, particularly in the context of EBPS exam sample questions. Two sets of 105 (total 210) sample questions each, derived from the EBPS sample questions, were collated. Results: In General Pediatric Surgery, GPT-3.5 provided correct answers for 7 questions (46.7%), and GPT-4 had a higher accuracy with 13 correct responses (86.7%) (p=0.020). For Newborn Surgery and Pediatric Urology, GPT-3.5 correctly answered 6 questions (40.0%), and GPT-4, however, correctly answered 12 questions (80.0%) (p= 0.025). In total, GPT-3.5 correctly answered 46 questions out of 105 (43.8%), and GPT-4 showed significantly better performance, correctly answering 80 questions (76.2%) (p<0.001). Given the total responses, when GPT-4 was compared with GPT-3.5, the Odds Ratio was found to be 4.1. This suggests that GPT-4 was 4.1 times more likely to provide a correct answer to the pediatric surgery questions compared to GPT-3.5. Conclusion: This comparative study concludes that GPT-4 significantly outperforms GPT-3.5 in responding to EBPS exam questions.

引用

页码：23 / 26

页数：4

共 33 条

[21] Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: A Comparative Study
Jin Wu
Zhiheng Wang
Yifan Qin
Journal of Medical Systems, 49 (1)
[22] ChatGPT vs UpToDate: comparative study of usefulness and reliability of Chatbot in common clinical presentations of otorhinolaryngology–head and neck surgery
Ziya Karimov
Irshad Allahverdiyev
Ozlem Yagiz Agayarov
Dogukan Demir
Elvina Almuradova
European Archives of Oto-Rhino-Laryngology, 2024, 281 : 2145 - 2151
[23] A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?
Nakajima, Nozomu
Fujimori, Takahito
Furuya, Masayuki
Kanie, Yuya
Imai, Hirotatsu
Kita, Kosuke
Uemura, Keisuke
Okada, Seiji
CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (03)
[24] ChatGPT vs UpToDate: comparative study of usefulness and reliability of Chatbot in common clinical presentations of otorhinolaryngology-head and neck surgery
Karimov, Ziya
Allahverdiyev, Irshad
Agayarov, Ozlem Yagiz
Demir, Dogukan
Almuradova, Elvina
EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, 2024, 281 (04) : 2145 - 2151
[25] Is ChatGPT 'ready' to be a learning tool for medical undergraduates and will it perform equally in different subjects? Comparative study of ChatGPT performance in tutorial and case-based learning questions in physiology and biochemistry
Luke, W. A. Nathasha V.
Chong, Lee Seow
Ban, Kenneth H.
Wong, Amanda H.
Xiong, Chen Zhi
Shing, Lee Shuh
Taneja, Reshma
Samarasekera, Dujeepa D.
Yap, Celestial T.
MEDICAL TEACHER, 2024, 46 (11) : 1441 - 1447
[26] Enhancing clinical skills in pediatric trainees: a comparative study of ChatGPT-assisted and traditional teaching methods
Ba, Hongjun
Zhang, Lili
Yi, Zizheng
BMC MEDICAL EDUCATION, 2024, 24 (01)
[27] How Many Operative Performance Ratings Does a Pediatric Surgery Fellow Need to Be Deemed Practice Ready?
Spencer, Brianna L.
Krumm, Andrew
Izadi, Shawn
Hirschl, Ronald B.
Modi, Biren P.
Ehrlich, Peter
Newman, Erika A.
Zendejas, Benjamin
JOURNAL OF PEDIATRIC SURGERY, 2024, 59 (01) : 31 - 36
[28] ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study
Arvidsson, Rasmus
Gunnarsson, Ronny
Entezarjou, Artin
Sundemo, David
Wikberg, Carl
BMJ OPEN, 2024, 14 (12):
[29] Evaluating AI-Generated informed consent documents in oral surgery: A comparative study of ChatGPT-4, Bard gemini advanced, and human-written consents
Vaira, Luigi Angelo
Lechien, Jerome R.
Maniaci, Antonino
Tanda, Giuseppe
Abbate, Vincenzo
Allevi, Fabiana
Arena, Antonio
Beltramini, Giada Anna
Bergonzani, Michela
Bolzoni, Alessandro Remigio
Crimi, Salvatore
Frosolini, Andrea
Gabriele, Guido
Maglitto, Fabio
Mayo-Yanez, Miguel
Orru, Ludovica
Petrocelli, Marzia
Pucci, Resi
Saibene, Alberto Maria
Troise, Stefania
Tel, Alessandro
Vellone, Valentino
Chiesa-Estomba, Carlos Miguel
Boscolo-Rizzo, Paolo
Salzano, Giovanni
De Riu, Giacomo
JOURNAL OF CRANIO-MAXILLOFACIAL SURGERY, 2025, 53 (01) : 18 - 23
[30] Assessing the accuracy, usefulness, and readability of artificialintelligence- generated responses to common dermatologic surgery questions for patient education: A double-blinded comparative study of ChatGPT and Google Bard
Robinson, Michelle A.
Belzberg, Micah
Thakker, Sach
Bibee, Kristin
Merkel, Emily
Macfarlane, Deborah F.
Lim, Jordan
Scott, Jeffrey F.
Deng, Min
Lewin, Jesse
Soleymani, David
Rosenfeld, David
Liu, Rosemarie
Liu, Tin Yan Alvin
Ng, Elise
JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY, 2024, 90 (05) : 1078 - 1080

← 1 2 3 4 →