Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study

被引:0
|
作者
Wu, Yuepeng [1 ]
Zhang, Yukang [2 ]
Xu, Mei [3 ]
Chen, Jinzhi [4 ]
Xue, Yican [5 ]
Zheng, Yuchen [1 ]
机构
[1] Zhejiang Prov Peoples Hosp, Affiliated Peoples Hosp, Hangzhou Med Coll, Ctr Plast & Reconstruct Surg,Dept Stomatol, Hangzhou, Zhejiang, Peoples R China
[2] Xianju Tradit Chinese Med Hosp, Taizhou, Zhejiang, Peoples R China
[3] Hangzhou Dent Hosp, West Branch, Hangzhou, Zhejiang, Peoples R China
[4] Hohai Univ, Coll Oceanog, Nanjing, Jiangsu, Peoples R China
[5] Hangzhou Med Coll, Hangzhou, Zhejiang, Peoples R China
关键词
Large language models; Artificial intelligence; Dental implantology; Clinical decision-making; Case analysis; KNOWLEDGE; QUALITY;
D O I
10.1186/s12911-025-02972-2
中图分类号
R-058 [];
学科分类号
摘要
BackgroundThis study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making.MethodsTwo dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics.ResultsChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning.ConclusionsChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Leveraging Large Language Models for Enhanced Classification and Analysis: Fire Incidents Case Study
    Alkhammash, Eman H.
    FIRE-SWITZERLAND, 2025, 8 (01):
  • [22] Large language models as tax attorneys: a case study in legal capabilities emergence
    Nay, John J.
    Karamardian, David
    Lawsky, Sarah B.
    Tao, Wenting
    Bhat, Meghana
    Jain, Raghav
    Lee, Aaron Travis
    Choi, Jonathan H.
    Kasai, Jungo
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2024, 382 (2270):
  • [23] Large Language Models and Sentiment Analysis in Financial Markets: A Review, Datasets, and Case Study
    Liu, Chenghao
    Arulappan, Arunkumar
    Naha, Ranesh
    Mahanti, Aniket
    Kamruzzaman, Joarder
    Ra, In-Ho
    IEEE ACCESS, 2024, 12 : 134041 - 134061
  • [24] A comparative analysis of knowledge injection strategies for large language models in the domain
    Cadeddu, Andrea
    Chessa, Alessandro
    De Leo, Vincenzo
    Fenu, Gianni
    Motta, Enrico
    Osborne, Francesco
    Recupero, Diego Reforgiato
    Salatino, Angelo
    Secchi, Luca
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [25] Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis
    Can, Elif
    Uller, Wibke
    Vogt, Katharina
    Doppler, Michael C.
    Busch, Felix
    Bayerl, Nadine
    Ellmann, Stephan
    Kader, Avan
    Elkilany, Aboelyazid
    Makowski, Marcus R.
    Bressem, Keno K.
    Adams, Lisa C.
    ACADEMIC RADIOLOGY, 2025, 32 (02) : 888 - 898
  • [26] Performance Assessment of Large Language Models in Medical Consultation: Comparative Study
    Seo, Sujeong
    Kim, Kyuli
    Yang, Heyoung
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [27] Safety analysis in the era of large language models: A case study of STPA using ChatGPT
    Qi, Yi
    Zhao, Xingyu
    Khastgir, Siddartha
    Huang, Xiaowei
    MACHINE LEARNING WITH APPLICATIONS, 2025, 19
  • [28] Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
    Wei, Boxiong
    JMIR MEDICAL EDUCATION, 2025, 11
  • [29] Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study
    Guo, Eddie
    Gupta, Mehul
    Deng, Jiawen
    Park, Ye-Jean
    Paget, Michael
    Naugler, Christopher
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [30] Making Large Language Models More Reliable and Beneficial: Taking ChatGPT as a Case Study
    Majeed, Abdul
    Hwang, Seong Oun
    COMPUTER, 2024, 57 (03) : 101 - 106