Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis

被引:31
作者
Song, Haifeng [1 ,2 ]
Xia, Yi [3 ,4 ]
Luo, Zhichao [1 ,2 ]
Liu, Hui [1 ,2 ]
Song, Yan [5 ]
Zeng, Xue [1 ,2 ]
Li, Tianjie [1 ,2 ]
Zhong, Guangxin [1 ,2 ]
Li, Jianxing [1 ,2 ]
Chen, Ming [3 ]
Zhang, Guangyuan [3 ]
Xiao, Bo [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing Tsinghua Changgung Hosp, Sch Clin Med, Dept Urol, 168 Litang Rd, Beijing 102218, Peoples R China
[2] Tsinghua Univ, Inst Urol, Sch Clin Med, Beijing 102218, Peoples R China
[3] Southeast Univ, Zhongda Hosp, Dept Urol, 87 Dingjiaqiao, Nanjing 210009, Peoples R China
[4] Southeast Univ, Sch Med, Nanjing 210009, Peoples R China
[5] China Med Univ, Urol Dept, Sheng Jing Hosp, Shenyang 110000, Peoples R China
关键词
Urolithiasis; Health consultation; Large language model; ChatGPT; Artificial intelligence;
D O I
10.1007/s10916-023-02021-3
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
ObjectivesTo evaluate the effectiveness of four large language models (LLMs) (Claude, Bard, ChatGPT4, and New Bing) that have large user bases and significant social attention, in the context of medical consultation and patient education in urolithiasis.Materials and methodsIn this study, we developed a questionnaire consisting of 21 questions and 2 clinical scenarios related to urolithiasis. Subsequently, clinical consultations were simulated for each of the four models to assess their responses to the questions. Urolithiasis experts then evaluated the model responses in terms of accuracy, comprehensiveness, ease of understanding, human care, and clinical case analysis ability based on a predesigned 5-point Likert scale. Visualization and statistical analyses were then employed to compare the four models and evaluate their performance.ResultsAll models yielded satisfying performance, except for Bard, who failed to provide a valid response to Question 13. Claude consistently scored the highest in all dimensions compared with the other three models. ChatGPT4 ranked second in accuracy, with a relatively stable output across multiple tests, but shortcomings were observed in empathy and human caring. Bard exhibited the lowest accuracy and overall performance. Claude and ChatGPT4 both had a high capacity to analyze clinical cases of urolithiasis. Overall, Claude emerged as the best performer in urolithiasis consultations and education.ConclusionClaude demonstrated superior performance compared with the other three in urolithiasis consultation and education. This study highlights the remarkable potential of LLMs in medical health consultations and patient education, although professional review, further evaluation, and modifications are still required.
引用
收藏
页数:9
相关论文
共 25 条
  • [2] Culturally competent healthcare systems - A systematic review
    Anderson, LM
    Scrimshaw, SC
    Fullilove, MT
    Fielding, JE
    Normand, J
    [J]. AMERICAN JOURNAL OF PREVENTIVE MEDICINE, 2003, 24 (03) : 68 - 79
  • [3] Will ChatGPT transform healthcare?
    不详
    [J]. NATURE MEDICINE, 2023, 29 (03) : 505 - 506
  • [4] Evaluating Artificial Intelligence Responses to Public Health Questions
    Ayers, John W.
    Zhu, Zechariah
    Poliak, Adam
    Leas, Eric C.
    Dredze, Mark
    Hogarth, Michael
    Smith, Davey M.
    [J]. JAMA NETWORK OPEN, 2023, 6 (06)
  • [5] Urolithiasis: Prevalence, risk factors, and public awareness regarding dietary and lifestyle habits in Jeddah, Saudi Arabia in 2017
    Baatiah, Nada Yasser
    Alhazmi, Raghad Bader
    Albathi, Fatmah Ali
    Albogami, Esraa Ghazi
    Mohammedkhalil, Abdullah Khalid
    Alsaywid, Basim Saleh
    [J]. UROLOGY ANNALS, 2020, 12 (01) : 57 - 62
  • [6] Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios
    Cascella, Marco
    Montomoli, Jonathan
    Bellini, Valentina
    Bignami, Elena
    [J]. JOURNAL OF MEDICAL SYSTEMS, 2023, 47 (01)
  • [7] The science of clinical practice: disease diagnosis or patient prognosis? Evidence about "what is likely to happen" should shape clinical practice
    Croft, Peter
    Altman, Douglas G.
    Deeks, Jonathan J.
    Dunn, Kate M.
    Hay, Alastair D.
    Hemingway, Harry
    LeResche, Linda
    Peat, George
    Perel, Pablo
    Petersen, Steffen E.
    Riley, Richard D.
    Roberts, Ian
    Sharpe, Michael
    Stevens, Richard J.
    Van Der Windt, Danielle A.
    Von Korff, Michael
    Timmis, Adam
    [J]. BMC MEDICINE, 2015, 13
  • [8] I Asked a ChatGPT to Write an Editorial About How We Can Incorporate Chatbots Into Neurosurgical Research and Patient Care.
    D'Amico, Randy S.
    White, Timothy G.
    Shah, Harshal A.
    Langer, David J.
    [J]. NEUROSURGERY, 2023, 92 (04) : 663 - 664
  • [9] Digital The Lancet., 2023, Lancet Digit Health, V5, DOI DOI 10.1016/S2589-7500(23)00023-7
  • [10] Best Practice in Interventional Management of Urolithiasis: An Update from the European Association of Urology Guidelines Panel for Urolithiasis 2022
    Geraghty, Robert M.
    Davis, Niall F.
    Tzelves, Lazaros
    Lombardo, Riccardo
    Yuan, Cathy
    Thomas, Kay
    Petrik, Ales
    Neisius, Andreas
    Tuerk, Christian
    Gambaro, Giovanni
    Skolarikos, Andreas
    Somani, Bhaskar K.
    [J]. EUROPEAN UROLOGY FOCUS, 2023, 9 (01): : 199 - 208