Chat generative pre-trained transformer's performance on dermatology-specific questions and its implications in medical education

被引：2

作者：

Behrmann, James ^{[1
]}

Hong, Ellen M. ^{[1
]}

Meledathu, Shannon ^{[1
]}

Leiter, Aliza ^{[1
]}

Povelaitis, Michael ^{[1
]}

Mitre, Mariela ^{[1
,2
]}

机构：

[1] Hackensack Meridian Hlth, Hackensack Meridian Sch Med, 123 Metro Blvd, Nutley, NJ 07110 USA

[2] Hackensack Univ, Dept Med, Div Dermatol, Med Ctr, Hackensack, NJ USA

来源：

JOURNAL OF MEDICAL ARTIFICIAL INTELLIGENCE | 2025年 / 6卷

关键词：

Artificial intelligence (AI); dermatology; chat generative pre-trained transformer (ChatGPT); ARTIFICIAL-INTELLIGENCE;

D O I：

10.21037/jmai-23-47

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Background: Large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) have gained popularity in healthcare by performing at or near the passing threshold for the United States Medical Licensing Exam (USMLE), but some limitations should be considered. Dermatology is a specialized medical field that relies heavily on visual recognition and images for diagnosis. This paper aimed to measure ChatGPT's abilities to answer dermatology questions and compare this sub-specialty accuracy to its overall scores on USMLE Step exams. Methods: A total of 492 dermatology-related questions from Amboss were separated into their corresponding medical licensing exam (Step 1 =160, Step 2CK =171, and Step 3 =161). The question stem and answer choices were input into ChatGPT, and the answers, question difficulty, and the presence of an image omitted from the prompt were recorded for each question. Results were calculated and compared against the estimated 60% passing standard. Results: ChatGPT answered 41% of all the questions correctly (Step 1 =41%, Step 2CK =38%, and Step 3 =46%). There was no significant difference in ChatGPT's ability to answer questions originally containing images or no image [P=0.205; 95% confidence interval (95% CI): 0.00 to 0.15], but it did score significantly lower compared to the estimated 60% threshold passing standard for USMLE exams (P=0.008; 95% CI: -0.29 to -0.08). Analyzing questions by difficulty level demonstrated a skewed distribution with easier-rated Conclusions: Our findings demonstrate that ChatGPT answered fewer correct dermatology-specific questions when compared to its overall performance on the USMLE (41% and 60%, respectively). Interestingly, ChatGPT scored similarly whether or not the question had an associated image, which may provide insight into how it utilizes its knowledge base to select answer choices. Using ChatGPT in conjunction with deep learning systems that include image analysis may improve accuracy and provide a more robust educational tool in dermatology.

引用

页数：8

共 29 条

[1]

AMBOSS | ACCME, About us

[2]

AMBOSS GmbH, AMBOSS Dermatology QBank

[3]

[Anonymous], Step Exams

[4] Evaluating the Performance of ChatGPT in Ophthalmology [J].

Antaki, Fares ;

Touma, Samir ;

Milad, Daniel ;

El -Khoury, Jonathan ;

Duval, Renaud .

OPHTHALMOLOGY SCIENCE, 2023, 3 (04)

[5]

Bavarian M, 2022, Arxiv, DOI [arXiv:2207.14255, 10.48550/arXiv.2207.14255, DOI 10.48550/ARXIV.2207.14255]

[6]

Berry NA, 2023, J Med Artif Intell, V6, P16, DOI [10.21037/jmai-23-47, DOI 10.21037/JMAI-23-47]

[7] Deep Learning in Medical Image Analysis [J].

Chan, Heang-Ping ;

Samala, Ravi K. ;

Hadjiiski, Lubomir M. ;

Zhou, Chuan .

DEEP LEARNING IN MEDICAL IMAGE ANALYSIS: CHALLENGES AND APPLICATIONS, 2020, 1213 :3-21

[8] Evaluation of diagnosis diversity in artificial intelligence datasets: a scoping review [J].

Chen, Michael L. ;

Rotemberg, Veronica ;

Lester, Jenna C. ;

Novoa, Roberto A. ;

Chiou, Albert S. ;

Daneshjou, Roxana .

BRITISH JOURNAL OF DERMATOLOGY, 2023, 188 (02) :292-294

[9]

Cheng JL, CNBC

[10] ABSTRACTS WRITTEN BY CHATGPT FOOL SCIENTISTS [J].

Else, Holly .

NATURE, 2023, 613 (7944) :423-423

← 1 2 3 →