Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy

被引:10
作者
Cheong, Kai Xiong [1 ]
Zhang, Chenxi [2 ,3 ]
Tan, Tien-En [1 ]
Fenner, Beau J. [1 ,4 ]
Wong, Wendy Meihua [5 ,6 ,7 ]
Teo, Kelvin Yc [1 ,4 ]
Wang, Ya Xing [8 ]
Sivaprasad, Sobha [9 ]
Keane, Pearse A. [10 ]
Lee, Cecilia Sungmin [11 ]
Lee, Aaron Y. [11 ]
Cheung, Chui Ming Gemmy [1 ,4 ]
Wong, Tien Yin [12 ,13 ]
Cheong, Yun-Gyung [14 ]
Song, Su Jeong [15 ]
Tham, Yih Chung [1 ,4 ,6 ,7 ]
机构
[1] Singapore Natl Eye Ctr, Singapore Eye Res Inst, Singapore, Singapore
[2] Chinese Acad Med Sci, Beijing, Peoples R China
[3] Peking Union Med Coll Hosp, Beijing, Peoples R China
[4] Duke NUS Med Sch, Ophthalmol & Visual Sci Acad Clin Program Eye ACP, Singapore, Singapore
[5] Natl Univ Singapore Hosp, Dept Ophthalmol, Singapore, Singapore
[6] Natl Univ Singapore, Ctr Innovat & Precis Eye Hlth, Yong Loo Lin Sch Med, Singapore, Singapore
[7] Natl Univ Singapore, Dept Ophthalmol, Yong Loo Lin Sch Med, Singapore, Singapore
[8] Capital Univ Med Sci, Beijing Tongren Hosp, Beijing Inst Ophthalmol, Beijing, Peoples R China
[9] Moorfields Eye Hosp NHS Fdn Trust, London, England
[10] Moorfields Eye Hosp NHS Fdn Trust, Med Retina, London, England
[11] Univ Washington, Dept Ophthalmol, Seattle, WA 98195 USA
[12] Tsinghua Univ, Tsinghua Med, Beijing, Peoples R China
[13] Beijing Tsinghua Changgung Hosp, Sch Clin Med, Beijing, Peoples R China
[14] Sungkyunkwan Univ, Seoul, South Korea
[15] Kangbuk Samsung Hosp, Seoul, South Korea
关键词
Macula; Public health; Retina;
D O I
10.1136/bjo-2023-324533
中图分类号
R77 [眼科学];
学科分类号
100212 ;
摘要
Background/aims To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR). Methods We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as 'Good', 'Borderline' or 'Poor' quality. Results Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p =8.4x10(-3)). Based on the consensus approach, 83.3% of ChatGPT-4' s responses and 86.7% of ChatGPT-3.5' s were rated as 'Good', surpassing Google Bard (50%) and OcularBERT (10%) (all p =1.4x10(-2)). ChatGPT-4 and ChatGPT-3.5 had no 'Poor' rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others. Conclusion ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.
引用
收藏
页码:1443 / 1449
页数:7
相关论文
共 34 条
[1]  
Adamopoulou E, 2020, IFIP INT C ART INT A, P373, DOI [DOI 10.1007/978-3-030-49186-4_31, 10.1007/978-3-030-49186-4, 10.1007/978-3-030-49186-431, 10.1007978-3-030-49186-431]
[2]  
Ali R, 2023, medRxiv, DOI [10.1101/2023.03.25.23287743, 10.1101/2023.03.25.23287743, DOI 10.1101/2023.03.25.23287743]
[3]   Will ChatGPT transform healthcare? [J].
不详 .
NATURE MEDICINE, 2023, 29 (03) :505-506
[4]   The future of medical education and research: Is ChatGPT a blessing or blight in disguise? [J].
Arif, Taha Bin ;
Munaf, Uzair ;
Ul-Haque, Ibtehaj .
MEDICAL EDUCATION ONLINE, 2023, 28 (01)
[5]   Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum [J].
Ayers, John W. ;
Poliak, Adam ;
Dredze, Mark ;
Leas, Eric C. ;
Zhu, Zechariah ;
Kelley, Jessica B. ;
Faix, Dennis J. ;
Goodman, Aaron M. ;
Longhurst, Christopher A. ;
Hogarth, Michael ;
Smith, Davey M. .
JAMA INTERNAL MEDICINE, 2023, 183 (06) :589-596
[6]   Large language models and their impact in ophthalmology [J].
Betzler, Bjorn Kaijun ;
Chen, Haichao ;
Cheng, Ching -Yu ;
Lee, Cecilia S. ;
Ning, Guochen ;
Song, Su Jeong ;
Lee, Aaron Y. ;
Kawasaki, Ryo ;
van Wijngaarden, Peter ;
Grzybowski, Andrzej ;
He, Mingguang ;
Li, Dawei ;
Ran, An Ran ;
Ting, Daniel Shu Wei ;
Teo, Kelvin ;
Ruamviboonsuk, Paisan ;
Sivaprasad, Sobha ;
Chaudhary, Varun ;
Tadayoni, Ramin ;
Wang, Xiaofei ;
Cheung, Carol Y. ;
Zheng, Yingfeng ;
Wang, Ya Xing ;
Tham, Yih Chung ;
Wong, Tien Yin .
LANCET DIGITAL HEALTH, 2023, 5 (12) :E917-E924
[7]   ChatGPT in the world of medical research: From how it works to how to use it [J].
Blandchand, Florian ;
Assefi, Mona ;
Gatulle, Nicolas ;
Constantin, Jean-Michel .
ANAESTHESIA CRITICAL CARE & PAIN MEDICINE, 2023, 42 (03)
[8]   ChatGPT: five priorities for research [J].
Bockting, Claudi ;
van Dis, Eva A. M. ;
Bollen, Johan ;
van Rooij, Robert ;
Zuidema, Willem L. .
NATURE, 2023, 614 (7947) :224-226
[9]   Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis [J].
Cao, Jennie J. ;
Kwon, Daniel H. ;
Ghaziani, Tara T. ;
Kwo, Paul ;
Tse, Gary ;
Kesselman, Andrew ;
Kamaya, Aya ;
Tse, Justin R. .
AMERICAN JOURNAL OF ROENTGENOLOGY, 2023, 221 (04) :556-559
[10]   Ophthalmology Inquiries on Reddit: What Should Physicians Know? [J].
Dave, Amisha D. ;
Zhu, Dagny .
CLINICAL OPHTHALMOLOGY, 2022, 16 :2923-2931