Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis

被引：0

作者：

Tong, Linjian ^{[1
]}

Zhang, Chaoyang ^{[2
]}

Liu, Rui ^{[1
]}

Yang, Jia ^{[1
]}

Sun, Zhiming ^{[1
]}

机构：

[1] Tianjin Med Univ, Clin Coll Neurol Neurosurg & Neurorehabil, Tianjin 300070, Peoples R China

[2] Tianjin Med Univ, Baodi Hosp, Dept Orthoped, Tianjin 301800, Peoples R China

来源：

JOURNAL OF ORTHOPAEDIC SURGERY AND RESEARCH | 2024年 / 19卷 / 01期

关键词：

Large language models; AI; ChatGPT; Google Gemini; Glucocorticoid-Induced osteoporosis;

D O I：

10.1186/s13018-024-04996-2

中图分类号：

R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学（修复外科学）];

学科分类号：

摘要：

Backgrounds The use of large language models (LLMs) in medicine can help physicians improve the quality and effectiveness of health care by increasing the efficiency of medical information management, patient care, medical research, and clinical decision-making. Methods We collected 34 frequently asked questions about glucocorticoid-induced osteoporosis (GIOP), covering topics related to the disease's clinical manifestations, pathogenesis, diagnosis, treatment, prevention, and risk factors. We also generated 25 questions based on the 2022 American College of Rheumatology Guideline for the Prevention and Treatment of Glucocorticoid-Induced Osteoporosis (2022 ACR-GIOP Guideline). Each question was posed to the LLM (ChatGPT-3.5, ChatGPT-4, and Google Gemini), and three senior orthopedic surgeons independently rated the responses generated by the LLMs. Three senior orthopedic surgeons independently rated the answers based on responses ranging between 1 and 4 points. A total score (TS) > 9 indicated 'good' responses, 6 <= TS <= 9 indicated 'moderate' responses, and TS < 6 indicated 'poor' responses. Results In response to the general questions related to GIOP and the 2022 ACR-GIOP Guidelines, Google Gemini provided more concise answers than the other LLMs. In terms of pathogenesis, ChatGPT-4 had significantly higher total scores (TSs) than ChatGPT-3.5. The TSs for answering questions related to the 2022 ACR-GIOP Guideline by ChatGPT-4 were significantly higher than those for Google Gemini. ChatGPT-3.5 and ChatGPT-4 had significantly higher self-corrected TSs than pre-corrected TSs, while Google Gemini self-corrected for responses that were not significantly different than before. Conclusions Our study showed that Google Gemini provides more concise and intuitive responses than ChatGPT-3.5 and ChatGPT-4. ChatGPT-4 performed significantly better than ChatGPT3.5 and Google Gemini in terms of answering general questions about GIOP and the 2022 ACR-GIOP Guidelines. ChatGPT3.5 and ChatGPT-4 self-corrected better than Google Gemini.

引用

页数：11

共 50 条

[1] Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3
Zhao, Fang-Fang
He, Han-Jie
Liang, Jia-Jian
Cen, Jingyun
Wang, Yun
Lin, Hongjie
Chen, Feifei
Li, Tai-Ping
Yang, Jian-Feng
Chen, Lan
Cen, Ling-Ping
EYE, 2024,
[2] Comment on: "Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3"
Luo, Xiao
Tang, Cheng
Chen, Jin-Jin
Yuan, Jin
Huang, Jin-Jin
Yan, Tao
EYE, 2025, : 1432 - 1432
[3] Comparative evaluation of ChatGPT-4, ChatGPT-3.5 and Google Gemini on PCOS assessment and management based on recommendations from the 2023 guideline
Gunesli, Irmak
Aksun, Seren
Fathelbab, Jana
Yildiz, Bulent Okan
ENDOCRINE, 2024, : 315 - 322
[4] Political Bias in Large Language Models: A Comparative Analysis of ChatGPT-4, Perplexity, Google Gemini, and Claude
Choudhary, Tavishi
IEEE ACCESS, 2025, 13 : 11341 - 11379
[5] Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: a comparative analysis
Schoch, Justine
Schmelz, H. -u.
Strauch, Angelina
Borgmann, Hendrik
Nestler, Tim
WORLD JOURNAL OF UROLOGY, 2024, 42 (01)
[6] Reply to 'Comment on: Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3'
Zhao, Fang-Fang
He, Han-Jie
Liang, Jia-Jian
Cen, Ling-Ping
EYE, 2025, : 1433 - 1433
[7] Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
Lim, Zhi Wei
Pushpanathan, Krithi
Yew, Samantha Min Er
Lai, Yien
Sun, Chen-Hsin
Lam, Janice Sing Harn
Chen, David Ziyou
Goh, Jocelyn Hui Lin
Tan, Marcus Chun Jin
Sheng, Bin
Cheng, Ching-Yu
Koh, Victor Teck Chang
Tham, Yih-Chung
EBIOMEDICINE, 2023, 95
[8] A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity
Reyhan, Ali Hakim
Mutaf, Cagri
Uzun, Irfan
Yuksekyayla, Funda
JOURNAL OF CLINICAL MEDICINE, 2024, 13 (21)
[9] Performance of ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard To Identify Correct Information for Lung Cancer
Le, Hoa
Truong, Chi
PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2024, 33 : 347 - 348
[10] Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan National Pharmacist Licensing Examination: Comparative Evaluation Study
Wang, Ying-Mei
Shen, Hung-Wei
Chen, Tzeng-Ji
Chiang, Shu-Chiung
Lin, Ting-Guan
JMIR MEDICAL EDUCATION, 2025, 11

← 1 2 3 4 5 →