Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study

被引:0
|
作者
Wu, Yuepeng [1 ]
Zhang, Yukang [2 ]
Xu, Mei [3 ]
Chen, Jinzhi [4 ]
Xue, Yican [5 ]
Zheng, Yuchen [1 ]
机构
[1] Zhejiang Prov Peoples Hosp, Affiliated Peoples Hosp, Hangzhou Med Coll, Ctr Plast & Reconstruct Surg,Dept Stomatol, Hangzhou, Zhejiang, Peoples R China
[2] Xianju Tradit Chinese Med Hosp, Taizhou, Zhejiang, Peoples R China
[3] Hangzhou Dent Hosp, West Branch, Hangzhou, Zhejiang, Peoples R China
[4] Hohai Univ, Coll Oceanog, Nanjing, Jiangsu, Peoples R China
[5] Hangzhou Med Coll, Hangzhou, Zhejiang, Peoples R China
关键词
Large language models; Artificial intelligence; Dental implantology; Clinical decision-making; Case analysis; KNOWLEDGE; QUALITY;
D O I
10.1186/s12911-025-02972-2
中图分类号
R-058 [];
学科分类号
摘要
BackgroundThis study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making.MethodsTwo dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics.ResultsChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning.ConclusionsChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
    Wilhelm, Theresa Isabelle
    Roos, Jonas
    Kaczmarczyk, Robert
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [2] A comparative analysis of large language models on clinical questions for autoimmune diseases
    Chen, Jing
    Ma, Juntao
    Yu, Jie
    Zhang, Weiming
    Zhu, Yijia
    Feng, Jiawei
    Geng, Linyu
    Dong, Xianchi
    Zhang, Huayong
    Chen, Yuxin
    Ning, Mingzhe
    FRONTIERS IN DIGITAL HEALTH, 2025, 7
  • [3] Evaluating the effectiveness of large language models in abstract screening: a comparative analysis
    Li, Michael
    Sun, Jianping
    Tan, Xianming
    SYSTEMATIC REVIEWS, 2024, 13 (01)
  • [4] Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study
    Kim, Woojun
    Kim, Bong Chul
    Yeom, Han-Gyeol
    INTERNATIONAL DENTAL JOURNAL, 2025, 75 (01) : 176 - 184
  • [5] Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases
    Wang, Xueqi
    Ye, Haiyan
    Zhang, Sumian
    Yang, Mei
    Wang, Xuebin
    JOURNAL OF MEDICAL SYSTEMS, 2025, 49 (01)
  • [6] A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology
    Murthy, Aravind Baskar
    Palaniappan, Vijayasankar
    Radhakrishnan, Suganya
    Rajaa, Sathish
    Karthikeyan, Kaliaperumal
    INDIAN DERMATOLOGY ONLINE JOURNAL, 2025, 16 (02) : 241 - 247
  • [7] Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination
    Liu, Mingxin
    Okuhara, Tsuyoshi
    Dai, Zhehao
    Huang, Wenbo
    Gu, Lin
    Okada, Hiroko
    Furukawa, Emi
    Kiuchi, Takahiro
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2025, 193
  • [8] Comparative Analysis of Large Language Models in Source Code Analysis
    Erdogan, Huseyin
    Turan, Nezihe Turhan
    Onan, Aytug
    INTELLIGENT AND FUZZY SYSTEMS, INFUS 2024 CONFERENCE, VOL 1, 2024, 1088 : 185 - 192
  • [9] Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions
    Song, Eun Sun
    Lee, Seung-Pyo
    INTERNATIONAL JOURNAL OF DENTAL HYGIENE, 2024,
  • [10] Artificial intelligence in clinical pharmacology: A case study and scoping review of large language models and bioweapon potential
    Rubinic, Igor
    Kurtov, Marija
    Rubinic, Ivan
    Likic, Robert
    Dargan, Paul I.
    Wood, David M.
    BRITISH JOURNAL OF CLINICAL PHARMACOLOGY, 2024, 90 (03) : 620 - 628