Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

被引:192
作者
Lim, Zhi Wei [1 ]
Pushpanathan, Krithi [1 ,2 ,3 ]
Yew, Samantha Min Er [1 ,2 ,3 ]
Lai, Yien [1 ,2 ,3 ,4 ]
Sun, Chen-Hsin [1 ,2 ,3 ,4 ]
Lam, Janice Sing Harn [1 ,2 ,3 ,4 ]
Chen, David Ziyou [1 ,2 ,3 ,4 ]
Goh, Jocelyn Hui Lin [5 ]
Tan, Marcus Chun Jin [1 ,2 ,3 ,4 ]
Sheng, Bin [6 ,7 ,8 ]
Cheng, Ching-Yu [1 ,2 ,3 ,5 ,9 ]
Koh, Victor Teck Chang [1 ,2 ,3 ,4 ]
Tham, Yih-Chung [1 ,2 ,3 ,5 ,9 ]
机构
[1] Natl Univ Singapore, Yong Loo Lin Sch Med, Level 13,MD1 Tahir Fdn Bldg,12 Sci Dr 2, Singapore 117549, Singapore
[2] Natl Univ Singapore, Ctr Innovat & Precis Eye Hlth, Yong Loo Lin Sch Med, Dept Ophthalmol, Singapore, Singapore
[3] Natl Univ Hlth Syst, Singapore, Singapore
[4] Natl Univ Singapore Hosp, Dept Ophthalmol, Singapore, Singapore
[5] Singapore Natl Eye Ctr, Singapore Eye Res Inst, Singapore, Singapore
[6] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China
[7] Shanghai Jiao Tong Univ Affiliated Peoples Hosp 6, Shanghai Diabet Inst, Shanghai Clin Ctr Diabet, Dept Endocrinol & Metab, Shanghai, Peoples R China
[8] Shanghai Jiao Tong Univ, Artificial Intelligence Inst, MoE Key Lab Artificial Intelligence, Shanghai, Peoples R China
[9] Duke NUS Med Sch, Eye Acad Clin Program Eye ACP, Singapore, Singapore
基金
英国医学研究理事会;
关键词
ChatGPT-4.0; ChatGPT-3.5; Google Bard; Chatbot; Myopia; Large language models; PREVENTION; ATROPINE;
D O I
10.1016/j.ebiom.2023.104770
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs' accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries. Methods We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains-pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. 'Good' rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, 'poor' rated responses were further prompted for self-correction and then re-evaluated for accuracy. Findings ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as 'good', compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson's chi-squared test, all p <= 0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0's, 40% (2 in 5) of ChatGPT-3.5's, and 60% (3 in 5) of Google Bard's responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for 'treatment and prevention'. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% 'good' ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson's chi-squared test, all p <= 0.001). Interpretation Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs' accuracy remain crucial.
引用
收藏
页数:11
相关论文
共 59 条
[31]   Myopia control effect of defocus incorporated multiple segments (DIMS) spectacle lens in Chinese children: results of a 3-year follow-up study [J].
Lam, Carly S. Y. ;
Tang, Wing Chun ;
Lee, Paul H. ;
Zhang, Han Yu ;
Qi, Hua ;
Hasegawa, Keigo ;
To, Chi Ho .
BRITISH JOURNAL OF OPHTHALMOLOGY, 2022, 106 (08) :1110-1114
[32]   Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine [J].
Lee, Peter ;
Bubeck, Sebastien ;
Petro, Joseph .
NEW ENGLAND JOURNAL OF MEDICINE, 2023, 388 (13) :1233-1239
[33]   Ethics of large language models in medicine and medical research [J].
Li, Hanzhou ;
Moon, John T. ;
Purkayastha, Saptarshi ;
Celi, Leo Anthony ;
Trivedi, Hari ;
Gichoya, Judy W. .
LANCET DIGITAL HEALTH, 2023, 5 (06) :E333-E335
[34]   Low Serum Vitamin D Is Not Correlated With Myopia in Chinese Children and Adolescents [J].
Li, Xiaoman ;
Lin, Haishuang ;
Jiang, Longfei ;
Chen, Xin ;
Chen, Jie ;
Lu, Fan .
FRONTIERS IN MEDICINE, 2022, 9
[35]  
Li Y F, 2021, Zhonghua Yan Ke Za Zhi, V57, P470, DOI [10.3760/cma.j.cn112142-20201120-00766, 10.3760/cma.j.cn112142-20201120-00766]
[36]   Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment [J].
Mihalache, Andrew ;
Popovic, Marko M. ;
Muni, Rajeev H. .
JAMA OPHTHALMOLOGY, 2023, 141 (06) :589-597
[37]  
Momenaei B, 2023, Ophthalmol Retina
[38]   Foundation models for generalist medical artificial intelligence [J].
Moor, Michael ;
Banerjee, Oishi ;
Abad, Zahra Shakeri Hossein ;
Krumholz, Harlan M. ;
Leskovec, Jure ;
Topol, Eric J. ;
Rajpurkar, Pranav .
NATURE, 2023, 616 (7956) :259-265
[39]  
National Eye Institute, 2023, At a glance: nearsightedness
[40]   Is Dietary Vitamin A Associated with Myopia from Adolescence to Young Adulthood? [J].
Ng, Fletcher J. ;
Mackey, David A. ;
O'Sullivan, Therese A. ;
Oddy, Wendy H. ;
Yazar, Seyhan .
TRANSLATIONAL VISION SCIENCE & TECHNOLOGY, 2020, 9 (06)