Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering

被引:40
作者
Antaki, Fares [1 ,2 ,3 ,4 ,5 ]
Milad, Daniel [4 ,5 ,6 ]
Chia, Mark A. [1 ,2 ]
Giguere, Charles-Edouard [7 ]
Touma, Samir [4 ,5 ,6 ]
El-Khoury, Jonathan [4 ,5 ,6 ]
Keane, Pearse A. [1 ,2 ,8 ]
Duval, Renaud [4 ,6 ]
机构
[1] Moorfields Eye Hosp NHS Fdn Trust, London, England
[2] UCL, Inst Ophthalmol, London, England
[3] CHUM Sch Artificial Intelligence Healthcare, Montreal, PQ, Canada
[4] Univ Montreal, Dept Ophthalmol, Montreal, PQ, Canada
[5] Ctr Hosp Univ Montreal CHUM, Dept Ophthalmol, Montreal, PQ, Canada
[6] Hop Maison Neuve Rosemont, Dept Ophthalmol, Montreal, PQ, Canada
[7] Inst Univ Sante Mentale Montreal IUSMM, Montreal, PQ, Canada
[8] NIHR Moorfields Biomed Res Ctr, London, England
关键词
Medical Education;
D O I
10.1136/bjo-2023-324438
中图分类号
R77 [眼科学];
学科分类号
100212 ;
摘要
Background Evidence on the performance of Generative Pre-trained Transformer 4 (GPT-4), a large language model (LLM), in the ophthalmology question-answering domain is needed. Methods We tested GPT-4 on two 260-question multiple choice question sets from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions question banks. We compared the accuracy of GPT-4 models with varying temperatures (creativity setting) and evaluated their responses in a subset of questions. We also compared the best-performing GPT-4 model to GPT-3.5 and to historical human performance. Results GPT-4- 0.3 (GPT-4 with a temperature of 0.3) achieved the highest accuracy among GPT-4 models, with 75.8% on the BCSC set and 70.0% on the OphthoQuestions set. The combined accuracy was 72.9%, which represents an 18.3% raw improvement in accuracy compared with GPT-3.5 (p<0.001). Human graders preferred responses from models with a temperature higher than 0 (more creative). Exam section, question difficulty and cognitive level were all predictive of GPT-4- 0.3 answer accuracy. GPT-4- 0.3' s performance was numerically superior to human performance on the BCSC (75.8% vs 73.3%) and OphthoQuestions (70.0% vs 63.0%), but the difference was not statistically significant (p=0.55 and p=0.09). Conclusion GPT-4, an LLM trained on non-ophthalmology-specific data, performs significantly better than its predecessor on simulated ophthalmology board-style exams. Remarkably, its performance tended to be superior to historical human performance, but that difference was not statistically significant in our study.
引用
收藏
页码:1371 / 1378
页数:8
相关论文
共 29 条
[1]  
[Anonymous], 2022, OpenAI
[2]   Evaluating the Performance of ChatGPT in Ophthalmology [J].
Antaki, Fares ;
Touma, Samir ;
Milad, Daniel ;
El -Khoury, Jonathan ;
Duval, Renaud .
OPHTHALMOLOGY SCIENCE, 2023, 3 (04)
[3]  
Bommasani Rishi., 2022, On the opportunities and risks of foundation models, DOI [DOI 10.48550/ARXIV.2108.07258, 10.48550/arXiv.2108.07258]
[4]  
Bowman E., 2022, NPR
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]  
Cai Louis Z, 2023, Am J Ophthalmol, V256, P201, DOI [10.1016/j.ajo.2023.07.030, 10.1016/j.ajo.2023.07.030]
[7]   Performance of Generative Large Language Models on Ophthalmology Board-Style Questions [J].
Cai, Louis Z. ;
Shaheen, Abdulla ;
Jin, Andrew ;
Fukui, Riya ;
Yi, Jonathan S. ;
Yannuzzi, Nicolas ;
Alabiad, Chrisfouad .
AMERICAN JOURNAL OF OPHTHALMOLOGY, 2023, 254 :141-149
[8]  
Chen LC, 2023, 2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), P13512
[9]   Exploring the Test-Taking Capabilities of Chatbots-From Surgeon to Sommelier [J].
Chia, Mark A. A. ;
Keane, Pearse A. .
JAMA OPHTHALMOLOGY, 2023, 141 (08) :800-801
[10]  
Chowdhery Aakanksha, 2022, arXiv