Assessing the proficiency of large language models on funduscopic disease knowledge

被引:0
作者
Wu, Jun-Yi [1 ]
Zeng, Yan-Mei [2 ]
Qian, Xian-Zhe [2 ]
Hong, Qi [2 ]
Hu, Jin-Yu [2 ]
Wei, Hong [2 ]
Zou, Jie [2 ]
Chen, Cheng [2 ]
Wang, Xiao-Yu [2 ]
Chen, Xu [3 ]
Shao, Yi [4 ]
机构
[1] Wuhan Fourth Hosp, Dept Ophthalmol, Wuhan 430033, Hubei, Peoples R China
[2] Nanchang Univ, Affiliated Hosp 1, Jiangxi Med Coll, Dept Ophthalmol, Nanchang 330006, Jiangxi, Peoples R China
[3] Maastricht Univ, Ophthalmol Ctr, NL-6200 MS Maastricht, Limburg, Netherlands
[4] Shanghai Jiao Tong Univ, Shanghai Gen Hosp, Natl Clin Res Ctr Eye Dis, Dept Ophthalmol,Sch Med, Shanghai 200080, Peoples R China
基金
中国国家自然科学基金;
关键词
large language models; ChatGPT; funduscopic disease;
D O I
10.18240/ijo.2025.07.03
中图分类号
R77 [眼科学];
学科分类号
100212 ;
摘要
AIM: To assess the performance of five distinct large language models (LLMs; ChatGPT-3.5, ChatGPT-4, PaLM2, Claude 2, and SenseNova) in comparison to two human cohorts (a group of funduscopic disease experts and a group of ophthalmologists) on the specialized subject of funduscopic disease. METHODS: Five distinct LLMs and two distinct human groups independently completed a 100-item funduscopic disease test. The performance of these entities was assessed by comparing their average scores, response stability, and answer confidence, thereby establishinga basis for evaluation. RESULTS: Among all the LLMs, ChatGPT-4 and PaLM2 exhibited the most substantial average correlation. Additionally, ChatGPT-4 achieved the highest average score and demonstrated the utmost confidence during the exam. In comparison to human cohorts, ChatGPT-4 exhibited comparable performance to ophthalmologists, albeit falling short of the expertise demonstrated by funduscopic disease specialists. CONCLUSION: The study provides evidence of the exceptional performance of ChatGPT-4 in the domain of funduscopic disease. With continued enhancements, validated LLMs have the potential to yield unforeseen advantages in enhancing healthcare for both patients and physicians.
引用
收藏
页码:1205 / 1213
页数:9
相关论文
共 48 条
[1]  
Achiam J., ARXIV
[2]   Predicting dementia from spontaneous speech using large language models [J].
Agbavor, Felix ;
Liang, Hualou .
PLOS DIGITAL HEALTH, 2022, 1 (12)
[3]  
Anil R, Palm 2 technical report
[4]   Evaluating the Performance of ChatGPT in Ophthalmology [J].
Antaki, Fares ;
Touma, Samir ;
Milad, Daniel ;
El -Khoury, Jonathan ;
Duval, Renaud .
OPHTHALMOLOGY SCIENCE, 2023, 3 (04)
[5]  
Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, 10.48550/arXiv.2005.14165, 10.48550/arxiv.2005.14165, DOI 10.48550/ARXIV.2005.14165]
[6]  
Bai Y., arXiv, DOI [DOI 10.48550/ARXIV.2204.05862, 10.48550/arXiv.2204.05862]
[7]  
Bai YT, 2022, Arxiv, DOI arXiv:2212.08073
[8]   Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions [J].
Bernstein, Isaac A. ;
Zhang, Youchen ;
Govil, Devendra ;
Majid, Iyad ;
Chang, Robert T. ;
Sun, Yang ;
Shue, Ann ;
Chou, Jonathan C. ;
Schehlein, Emily ;
Christopher, Karen L. ;
Groth, Sylvia L. ;
Ludwig, Cassie ;
Wang, Sophia Y. .
JAMA NETWORK OPEN, 2023, 6 (08)
[9]  
Bubeck S, 2023, Arxiv, DOI [arXiv:2303.12712, 10.48550/ARXIV.2303.12712]
[10]  
Chen M., 2021, Evaluating large language models trained on code, DOI DOI 10.48550/ARXIV.2107.03374