Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions

被引：0

作者：

Bereuter, Jean-Paul ^{[1
,2
]}

Geissler, Mark Enrik ^{[1
,2
,3
]}

Klimova, Anna ^{[2
,4
]}

Steiner, Robert-Patrick ^{[2
,5
]}

Pfeiffer, Kevin ^{[2
,3
]}

Kolbinger, Fiona R. ^{[1
,2
,6
]}

Wiest, Isabella C. ^{[2
,3
,7
]}

Muti, Hannah Sophie ^{[1
,2
,3
,8
]}

Kather, Jakob Nikolas ^{[2
,3
,8
,9
]}

机构：

[1] TUD Dresden Univ Technol, Fac Med, Dept Visceral Thorac & Vasc Surg, Dresden, Germany

[2] TUD Dresden Univ Technol, Univ Hosp Carl Gustav Carus, Dresden, Germany

[3] TUD Dresden Univ Technol, Fac Med, Else Kroener Fresenius Ctr Digital Hlth, Dresden, Germany

[4] TUD Dresden Univ Technol, Inst Med Informat & Biometry, Fac Med, Dresden, Germany

[5] TUD Dresden Univ Technol, Inst Pharmacol & Toxicol, Fac Med, Dresden, Germany

[6] Purdue Univ, Weldon Sch Biomed Engn, W Lafayette, IN USA

[7] Heidelberg Univ, Med Fac Mannheim, Dept Med 2, Mannheim, Germany

[8] Univ Hosp Heidelberg, Natl Ctr Tumor Dis, Med Oncol, Heidelberg, Germany

[9] TUD Dresden Univ Technol, Fac Med, Dept Med, Dresden, Germany

来源：

JOURNAL OF SURGICAL EDUCATION | 2025年 / 82卷 / 04期

基金：

美国国家卫生研究院; 欧洲研究理事会;

关键词：

exam questions; large language models; vision language models; vision capabilities; PERFORMANCE; CHATGPT; MEDICINE; GPT-4;

D O I：

10.1016/j.jsurg.2025.103442

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

OBJECTIVE: Recent studies investigated the potential of large language models (LLMs) for clinical decision making and answering exam questions based on text input. Recent developments of LLMs have extended these models with vision capabilities. These image processing LLMs are called vision-language models (VLMs). However, there is limited investigation on the applicability of VLMs and their capabilities of answering exam questions with image content. Therefore, the aim of this study was to examine the performance of publicly accessible LLMs in 2 different surgical question sets consisting of text and image questions. DESIGN: Original text and image exam questions from 2 different surgical question subsets from the German Medical Licensing Examination (GMLE) and United States Medical Licensing Examination (USMLE) were collected and answered by publicly available LLMs (GPT-4, Claude-3 Sonnet, Gemini-1.5). LLM outputs were bench- marked for their accuracy in answering text and image questions. Additionally, the LLMs' performance was compared to students' performance based on their average historical performance (AHP) in these exams. Moreover, variations of LLM performance were analyzed in relation to question difficulty and respective image type. RESULTS: Overall, all LLMs achieved scores equivalent to passing grades (>= 60%) on surgical text questions across both datasets. On image-based questions, only GPT-4 exceeded the score required to pass, significantly outperforming Claude-3 and Gemini-1.5 (GPT: 78% vs. Claude-3: 58% vs. Gemini-1.5: 57.3%; p < 0.001). Additionally, GPT-4 outperformed students on both text (GPT: 83.7% vs. AHP students: 67.8%; p < 0.001) and image questions (GPT: 78% vs. AHP students: 67.4%; p <0.001). CONCLUSION: GPT-4 demonstrated substantial capabilities in answering surgical text and image exam questions. Therefore, it holds considerable potential for the use in surgical decision making and education of students and trainee surgeons. (c) 2025 The Author(s). Published by Elsevier Inc. on behalf of Association of Program Directors in Surgery. This is an open access article under the CC BY license

引用

页数：11

共 35 条

[1] Topol E.J., High-performance medicine: the convergence of human and artificial intelligence, Nat Med, 25, 1, pp. 44-56, (2019)
[2] Rajpurkar P., Chen E., Banerjee O., Topol E.J., AI in health and medicine, Nat Med, 28, 1, pp. 31-38, (2022)
[3] Hosny A., Parmar C., Quackenbush J., Schwartz L.H., Aerts H.J.W.L., Artificial intelligence in radiology, Nat Rev Cancer, 18, 8, pp. 500-510, (2018)
[4] Young A.T., Xiong M., Pfau J., Keiser M.J., Wei M.L., Artificial intelligence in dermatology: a primer, J Investig Dermatol, 140, 8, pp. 1504-1512, (2020)
[5] Dhombres F., Bonnard J., Bailly K., Maurice P., Papageorghiou A.T., Jouannic J.M., Contributions of artificial intelligence reported in obstetrics and gynecology journals: systematic review, J Med Internet Res, 24, 4, (2022)
[6] Sharma P., Hassan C., Artificial intelligence and deep learning for upper gastrointestinal neoplasia, Gastroenterology, 162, 4, pp. 1056-1066, (2022)
[7] Wallace M.B., Sharma P., Bhandari P., Et al., Impact of artificial intelligence on miss rate of colorectal neoplasia, Gastroenterology, 163, 1, pp. 295-304.e5, (2022)
[8] Varghese C., Harrison E.M., O'Grady G., Topol E.J., Artificial intelligence in surgery, Nat Med, 30, 5, pp. 1257-1268, (2024)
[9] Esteva A., Robicquet A., Ramsundar B., Et al., A guide to deep learning in healthcare, Nat Med, 25, 1, pp. 24-29, (2019)
[10] Han R., Acosta J.N., Shakeri Z., Ioannidis J.P.A., Topol E.J., Rajpurkar P., (2023)

← 1 2 3 4 →