Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis

被引:0
|
作者
Tong, Linjian [1 ]
Zhang, Chaoyang [2 ]
Liu, Rui [1 ]
Yang, Jia [1 ]
Sun, Zhiming [1 ]
机构
[1] Tianjin Med Univ, Clin Coll Neurol Neurosurg & Neurorehabil, Tianjin 300070, Peoples R China
[2] Tianjin Med Univ, Baodi Hosp, Dept Orthoped, Tianjin 301800, Peoples R China
来源
JOURNAL OF ORTHOPAEDIC SURGERY AND RESEARCH | 2024年 / 19卷 / 01期
关键词
Large language models; AI; ChatGPT; Google Gemini; Glucocorticoid-Induced osteoporosis;
D O I
10.1186/s13018-024-04996-2
中图分类号
R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学(修复外科学)];
学科分类号
摘要
Backgrounds The use of large language models (LLMs) in medicine can help physicians improve the quality and effectiveness of health care by increasing the efficiency of medical information management, patient care, medical research, and clinical decision-making. Methods We collected 34 frequently asked questions about glucocorticoid-induced osteoporosis (GIOP), covering topics related to the disease's clinical manifestations, pathogenesis, diagnosis, treatment, prevention, and risk factors. We also generated 25 questions based on the 2022 American College of Rheumatology Guideline for the Prevention and Treatment of Glucocorticoid-Induced Osteoporosis (2022 ACR-GIOP Guideline). Each question was posed to the LLM (ChatGPT-3.5, ChatGPT-4, and Google Gemini), and three senior orthopedic surgeons independently rated the responses generated by the LLMs. Three senior orthopedic surgeons independently rated the answers based on responses ranging between 1 and 4 points. A total score (TS) > 9 indicated 'good' responses, 6 <= TS <= 9 indicated 'moderate' responses, and TS < 6 indicated 'poor' responses. Results In response to the general questions related to GIOP and the 2022 ACR-GIOP Guidelines, Google Gemini provided more concise answers than the other LLMs. In terms of pathogenesis, ChatGPT-4 had significantly higher total scores (TSs) than ChatGPT-3.5. The TSs for answering questions related to the 2022 ACR-GIOP Guideline by ChatGPT-4 were significantly higher than those for Google Gemini. ChatGPT-3.5 and ChatGPT-4 had significantly higher self-corrected TSs than pre-corrected TSs, while Google Gemini self-corrected for responses that were not significantly different than before. Conclusions Our study showed that Google Gemini provides more concise and intuitive responses than ChatGPT-3.5 and ChatGPT-4. ChatGPT-4 performed significantly better than ChatGPT3.5 and Google Gemini in terms of answering general questions about GIOP and the 2022 ACR-GIOP Guidelines. ChatGPT3.5 and ChatGPT-4 self-corrected better than Google Gemini.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3
    Zhao, Fang-Fang
    He, Han-Jie
    Liang, Jia-Jian
    Cen, Jingyun
    Wang, Yun
    Lin, Hongjie
    Chen, Feifei
    Li, Tai-Ping
    Yang, Jian-Feng
    Chen, Lan
    Cen, Ling-Ping
    EYE, 2024,
  • [2] Comment on: "Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3"
    Luo, Xiao
    Tang, Cheng
    Chen, Jin-Jin
    Yuan, Jin
    Huang, Jin-Jin
    Yan, Tao
    EYE, 2025, : 1432 - 1432
  • [3] Comparative evaluation of ChatGPT-4, ChatGPT-3.5 and Google Gemini on PCOS assessment and management based on recommendations from the 2023 guideline
    Gunesli, Irmak
    Aksun, Seren
    Fathelbab, Jana
    Yildiz, Bulent Okan
    ENDOCRINE, 2024, : 315 - 322
  • [4] Political Bias in Large Language Models: A Comparative Analysis of ChatGPT-4, Perplexity, Google Gemini, and Claude
    Choudhary, Tavishi
    IEEE ACCESS, 2025, 13 : 11341 - 11379
  • [5] Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: a comparative analysis
    Schoch, Justine
    Schmelz, H. -u.
    Strauch, Angelina
    Borgmann, Hendrik
    Nestler, Tim
    WORLD JOURNAL OF UROLOGY, 2024, 42 (01)
  • [6] Reply to 'Comment on: Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3'
    Zhao, Fang-Fang
    He, Han-Jie
    Liang, Jia-Jian
    Cen, Ling-Ping
    EYE, 2025, : 1433 - 1433
  • [7] Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
    Lim, Zhi Wei
    Pushpanathan, Krithi
    Yew, Samantha Min Er
    Lai, Yien
    Sun, Chen-Hsin
    Lam, Janice Sing Harn
    Chen, David Ziyou
    Goh, Jocelyn Hui Lin
    Tan, Marcus Chun Jin
    Sheng, Bin
    Cheng, Ching-Yu
    Koh, Victor Teck Chang
    Tham, Yih-Chung
    EBIOMEDICINE, 2023, 95
  • [8] A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity
    Reyhan, Ali Hakim
    Mutaf, Cagri
    Uzun, Irfan
    Yuksekyayla, Funda
    JOURNAL OF CLINICAL MEDICINE, 2024, 13 (21)
  • [9] Performance of ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard To Identify Correct Information for Lung Cancer
    Le, Hoa
    Truong, Chi
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2024, 33 : 347 - 348
  • [10] Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan National Pharmacist Licensing Examination: Comparative Evaluation Study
    Wang, Ying-Mei
    Shen, Hung-Wei
    Chen, Tzeng-Ji
    Chiang, Shu-Chiung
    Lin, Ting-Guan
    JMIR MEDICAL EDUCATION, 2025, 11