Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy

被引:10
作者
Cheong, Kai Xiong [1 ]
Zhang, Chenxi [2 ,3 ]
Tan, Tien-En [1 ]
Fenner, Beau J. [1 ,4 ]
Wong, Wendy Meihua [5 ,6 ,7 ]
Teo, Kelvin Yc [1 ,4 ]
Wang, Ya Xing [8 ]
Sivaprasad, Sobha [9 ]
Keane, Pearse A. [10 ]
Lee, Cecilia Sungmin [11 ]
Lee, Aaron Y. [11 ]
Cheung, Chui Ming Gemmy [1 ,4 ]
Wong, Tien Yin [12 ,13 ]
Cheong, Yun-Gyung [14 ]
Song, Su Jeong [15 ]
Tham, Yih Chung [1 ,4 ,6 ,7 ]
机构
[1] Singapore Natl Eye Ctr, Singapore Eye Res Inst, Singapore, Singapore
[2] Chinese Acad Med Sci, Beijing, Peoples R China
[3] Peking Union Med Coll Hosp, Beijing, Peoples R China
[4] Duke NUS Med Sch, Ophthalmol & Visual Sci Acad Clin Program Eye ACP, Singapore, Singapore
[5] Natl Univ Singapore Hosp, Dept Ophthalmol, Singapore, Singapore
[6] Natl Univ Singapore, Ctr Innovat & Precis Eye Hlth, Yong Loo Lin Sch Med, Singapore, Singapore
[7] Natl Univ Singapore, Dept Ophthalmol, Yong Loo Lin Sch Med, Singapore, Singapore
[8] Capital Univ Med Sci, Beijing Tongren Hosp, Beijing Inst Ophthalmol, Beijing, Peoples R China
[9] Moorfields Eye Hosp NHS Fdn Trust, London, England
[10] Moorfields Eye Hosp NHS Fdn Trust, Med Retina, London, England
[11] Univ Washington, Dept Ophthalmol, Seattle, WA 98195 USA
[12] Tsinghua Univ, Tsinghua Med, Beijing, Peoples R China
[13] Beijing Tsinghua Changgung Hosp, Sch Clin Med, Beijing, Peoples R China
[14] Sungkyunkwan Univ, Seoul, South Korea
[15] Kangbuk Samsung Hosp, Seoul, South Korea
关键词
Macula; Public health; Retina;
D O I
10.1136/bjo-2023-324533
中图分类号
R77 [眼科学];
学科分类号
100212 ;
摘要
Background/aims To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR). Methods We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as 'Good', 'Borderline' or 'Poor' quality. Results Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p =8.4x10(-3)). Based on the consensus approach, 83.3% of ChatGPT-4' s responses and 86.7% of ChatGPT-3.5' s were rated as 'Good', surpassing Google Bard (50%) and OcularBERT (10%) (all p =1.4x10(-2)). ChatGPT-4 and ChatGPT-3.5 had no 'Poor' rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others. Conclusion ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.
引用
收藏
页码:1443 / 1449
页数:7
相关论文
共 34 条
[21]   The imperative for regulatory oversight of large language models (or generative AI) in healthcare [J].
Mesko, Bertalan ;
Topol, Eric J. J. .
NPJ DIGITAL MEDICINE, 2023, 6 (01)
[22]   Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment [J].
Mihalache, Andrew ;
Popovic, Marko M. ;
Muni, Rajeev H. .
JAMA OPHTHALMOLOGY, 2023, 141 (06) :589-597
[23]   Foundation models for generalist medical artificial intelligence [J].
Moor, Michael ;
Banerjee, Oishi ;
Abad, Zahra Shakeri Hossein ;
Krumholz, Harlan M. ;
Leskovec, Jure ;
Topol, Eric J. ;
Rajpurkar, Pranav .
NATURE, 2023, 616 (7956) :259-265
[24]   ChatGPT: the future of discharge summaries? [J].
Patel, Sajan B. ;
Lam, Kyle .
LANCET DIGITAL HEALTH, 2023, 5 (03) :E107-E108
[25]   Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams [J].
Raimondi, Raffaele ;
Tzoumas, Nikolaos ;
Salisbury, Thomas ;
Di Simplicio, Sandro ;
Romano, Mario .
EYE, 2023, 37 (17) :3530-3533
[26]   Artificial intelligence-based ChatGPT chatbot responses for patient and parent questions on vernal keratoconjunctivitis [J].
Rasmussen, Marie Louise Roed ;
Larsen, Ann-Cathrine ;
Subhi, Yousif ;
Potapenko, Ivan .
GRAEFES ARCHIVE FOR CLINICAL AND EXPERIMENTAL OPHTHALMOLOGY, 2023, 261 (10) :3041-3043
[27]  
Reimers N, 2019, Arxiv, DOI [arXiv:1908.10084, 10.48550/arXiv.1908.10084]
[28]   Evaluating Chatbot Efficacy for Answering Frequently Asked Questions in Plastic Surgery: A ChatGPT Case Study Focused on Breast Augmentation [J].
Seth, Ishith ;
Cox, Aram ;
Xie, Yi ;
Bulloch, Gabriella ;
Hunter-Smith, David J. ;
Rozen, Warren M. ;
Ross, Richard J. .
AESTHETIC SURGERY JOURNAL, 2023, 43 (10) :1126-1135
[29]   ChatGPT and Ophthalmology: Exploring Its Potential with Discharge Summaries and Operative Notes [J].
Singh, Swati ;
Djalilian, Ali ;
Ali, Mohammad Javed .
SEMINARS IN OPHTHALMOLOGY, 2023, 38 (05) :503-507
[30]   Large language models encode clinical knowledge [J].
Singhal, Karan ;
Azizi, Shekoofeh ;
Tu, Tao ;
Mahdavi, S. Sara ;
Wei, Jason ;
Chung, Hyung Won ;
Scales, Nathan ;
Tanwani, Ajay ;
Cole-Lewis, Heather ;
Pfohl, Stephen ;
Payne, Perry ;
Seneviratne, Martin ;
Gamble, Paul ;
Kelly, Chris ;
Babiker, Abubakr ;
Schaerli, Nathanael ;
Chowdhery, Aakanksha ;
Mansfield, Philip ;
Demner-Fushman, Dina ;
Arcas, Blaise Aguera y ;
Webster, Dale ;
Corrado, Greg S. ;
Matias, Yossi ;
Chou, Katherine ;
Gottweis, Juraj ;
Tomasev, Nenad ;
Liu, Yun ;
Rajkomar, Alvin ;
Barral, Joelle ;
Semturs, Christopher ;
Karthikesalingam, Alan ;
Natarajan, Vivek .
NATURE, 2023, 620 (7972) :172-+