Comparing Vision-Capable Models, GPT-4 and Gemini, With GPT-3.5 on Taiwan's Pulmonologist Exam

被引：1

作者：

Chen, Chih-Hsiung ^{[1
]}

Hsieh, Kuang-Yu ^{[1
]}

Huang, Kuo-En ^{[1
]}

Lai, Hsien-Yun ^{[2
]}

机构：

[1] Mennonite Christian Hosp, Dept Crit Care Med, Hualien, Taiwan

[2] Mennonite Christian Hosp, Dept Educ & Res, Hualien, Taiwan

来源：

CUREUS JOURNAL OF MEDICAL SCIENCE | 2024年 / 16卷 / 08期

关键词：

vision feature; pulmonologist exam; gemini; gpt; large language models; artificial intelligence;

D O I：

10.7759/cureus.67641

中图分类号：

R5 [内科学];

学科分类号：

1002 ; 100201 ;

摘要：

Introduction The latest generation of large language models (LLMs) features multimodal capabilities, allowing them to interpret graphics, images, and videos, which are crucial in medical fields. This study investigates the vision capabilities of the next-generation Generative Pre-trained Transformer 4 (GPT-4) and Google's Gemini. Methods To establish a comparative baseline, we used GPT-3.5, a model limited to text processing, and evaluated the performance of both GPT-4 and Gemini on questions from the Taiwan Specialist Board Exams in Pulmonary and Critical Care Medicine. Our dataset included 1,100 questions from 2012 to 2023, with 100 questions per year. Of these, 1,059 were in pure text and 41 were text with images, with the majority in a non-English language and only six in pure English. Results For each annual exam consisting of 100 questions from 2013 to 2023, GPT-4 achieved scores of 66, 69, 51, 64, 72, 64, 66, 64, 63, 68, and 67, respectively. Gemini scored 45, 48, 45, 45, 46, 59, 54, 41, 53, 45, and 45, while GPT-3.5 scored 39, 33, 35, 36, 32, 33, 43, 28, 32, 33, and 36. Conclusions These results demonstrate that the newer LLMs with vision capabilities significantly outperform the text- only model. When a passing score of 60 was set, GPT-4 passed most exams and approached human performance.

引用

页数：9

共 35 条

[31] Human-Comparable Sensitivity of Large Language Models inIdenti fying Eligible Studies Through Title and Abstract Screening:3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews
Matsui, Kentaro
Utsumi, Tomohiro
Aoki, Yumi
Maruki, Taku
Takeshima, Masahiro
Takaesu, Yoshikazu
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[32] Large Language Models for Code Obfuscation Evaluation of the Obfuscation Capabilities of OpenAI's GPT-3.5 on C Source Code
Kochberger, Patrick
Gramberger, Maximilian
Schrittwieser, Sebastian
Lawitschka, Caroline
Weippl, Edgar R.
PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY, SECRYPT 2023, 2023, : 7 - 19
[33] Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study
Roos, Jonas
Martin, Ron
Kaczmarczyk, Robert
JMIR FORMATIVE RESEARCH, 2024, 8
[34] Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy's rule-based and machine learning-based methods
Bhattarai, Kriti
Oh, Inez Y.
Sierra, Jonathan Moran
Tang, Jonathan
Payne, Philip R. O.
Abrams, Zach
Lai, Albert M.
JAMIA OPEN, 2024, 7 (03)
[35] Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages
Khraisha, Qusai
Put, Sophie
Kappenberg, Johanna
Warraitch, Azza
Hadfield, Kristin
RESEARCH SYNTHESIS METHODS, 2024, 15 (04) : 616 - 626

← 1 2 3 4 →