Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases

被引:15
作者
Milad, Daniel [1 ,2 ]
Antaki, Fares [1 ,3 ,4 ]
Milad, Jason [5 ]
Farah, Andrew [6 ]
Khairy, Thomas [6 ]
Mikhail, David [7 ]
Giguere, Charles-Edouard [8 ]
Touma, Samir [1 ,2 ]
Bernstein, Allison [1 ,2 ]
Szigiato, Andrei-Alexandru [1 ,9 ]
Nayman, Taylor [1 ,2 ]
Mullie, Guillaume A. [1 ,10 ]
Duval, Renaud [1 ,2 ]
机构
[1] Univ Montreal, Dept Ophthalmol, Montreal, PQ, Canada
[2] Hop Maisonneuve Rosement, Dept Ophthalmol, Montreal, PQ, Canada
[3] UCL, Inst Ophthalmol, London, England
[4] Ctr Hosp Univ Montreal CHUM, CHUM Sch Artificial Intelligence Healthcare SAIH, Montreal, PQ, Canada
[5] Univ Waterloo, Dept Software Engn, Waterloo, ON, Canada
[6] McGill Univ, Fac Med, Montreal, PQ, Canada
[7] Univ Toronto, Fac Med, Toronto, ON, Canada
[8] Inst Univ Sante Mentale Montreal MontrOal, Ctr Rech, Montreal, PQ, Canada
[9] Hop Sacre Coeur Montreal, Dept Ophthalmol, Montreal, PQ, Canada
[10] Cite La Sante Hosp, Dept Ophthalmol, Laval, PQ, Canada
关键词
D O I
10.1136/bjo-2023-325053
中图分类号
R77 [眼科学];
学科分类号
100212 ;
摘要
Background/aims This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases. Methods We tested GPT-4 on 422 Journal of the American Medical Association Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and- solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort. Results Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p=0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020). Conclusion Improved prompting enhances GPT-4's performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.
引用
收藏
页码:1398 / 1405
页数:8
相关论文
共 27 条
[1]  
2023, Arxiv, DOI [arXiv:2303.08774, DOI 10.48550/ARXIV.2303.08774]
[2]   Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering [J].
Antaki, Fares ;
Milad, Daniel ;
Chia, Mark A. ;
Giguere, Charles-Edouard ;
Touma, Samir ;
El-Khoury, Jonathan ;
Keane, Pearse A. ;
Duval, Renaud .
BRITISH JOURNAL OF OPHTHALMOLOGY, 2024, 108 (10) :1371-1378
[3]   Evaluating the Performance of ChatGPT in Ophthalmology [J].
Antaki, Fares ;
Touma, Samir ;
Milad, Daniel ;
El -Khoury, Jonathan ;
Duval, Renaud .
OPHTHALMOLOGY SCIENCE, 2023, 3 (04)
[4]   Large language models and their impact in ophthalmology [J].
Betzler, Bjorn Kaijun ;
Chen, Haichao ;
Cheng, Ching -Yu ;
Lee, Cecilia S. ;
Ning, Guochen ;
Song, Su Jeong ;
Lee, Aaron Y. ;
Kawasaki, Ryo ;
van Wijngaarden, Peter ;
Grzybowski, Andrzej ;
He, Mingguang ;
Li, Dawei ;
Ran, An Ran ;
Ting, Daniel Shu Wei ;
Teo, Kelvin ;
Ruamviboonsuk, Paisan ;
Sivaprasad, Sobha ;
Chaudhary, Varun ;
Tadayoni, Ramin ;
Wang, Xiaofei ;
Cheung, Carol Y. ;
Zheng, Yingfeng ;
Wang, Ya Xing ;
Tham, Yih Chung ;
Wong, Tien Yin .
LANCET DIGITAL HEALTH, 2023, 5 (12) :E917-E924
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]  
Buckley T, 2024, Arxiv, DOI [arXiv:2311.05591, DOI 10.48550/ARXIV.2311.05591, 10.48550/arXiv.2311.05591]
[7]   Performance of Generative Large Language Models on Ophthalmology Board-Style Questions [J].
Cai, Louis Z. ;
Shaheen, Abdulla ;
Jin, Andrew ;
Fukui, Riya ;
Yi, Jonathan S. ;
Yannuzzi, Nicolas ;
Alabiad, Chrisfouad .
AMERICAN JOURNAL OF OPHTHALMOLOGY, 2023, 254 :141-149
[8]  
Delsoz M, 2023, medRxiv, DOI [10.1101/2023.08.25.23294635, 10.1101/2023.08.25.23294635, DOI 10.1101/2023.08.25.23294635]
[9]   The Use of ChatGPT to Assist in Diagnosing Glaucoma Based on Clinical Case Reports [J].
Delsoz, Mohammad ;
Raja, Hina ;
Madadi, Yeganeh ;
Tang, Anthony A. ;
Wirostko, Barbara M. ;
Kahook, Malik Y. ;
Yousefi, Siamak .
OPHTHALMOLOGY AND THERAPY, 2023, 12 (06) :3121-3132
[10]  
Eriksen AV., NEJM AI, 2023