Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study

被引:0
|
作者
Wu, Yuepeng [1 ]
Zhang, Yukang [2 ]
Xu, Mei [3 ]
Chen, Jinzhi [4 ]
Xue, Yican [5 ]
Zheng, Yuchen [1 ]
机构
[1] Zhejiang Prov Peoples Hosp, Affiliated Peoples Hosp, Hangzhou Med Coll, Ctr Plast & Reconstruct Surg,Dept Stomatol, Hangzhou, Zhejiang, Peoples R China
[2] Xianju Tradit Chinese Med Hosp, Taizhou, Zhejiang, Peoples R China
[3] Hangzhou Dent Hosp, West Branch, Hangzhou, Zhejiang, Peoples R China
[4] Hohai Univ, Coll Oceanog, Nanjing, Jiangsu, Peoples R China
[5] Hangzhou Med Coll, Hangzhou, Zhejiang, Peoples R China
关键词
Large language models; Artificial intelligence; Dental implantology; Clinical decision-making; Case analysis; KNOWLEDGE; QUALITY;
D O I
10.1186/s12911-025-02972-2
中图分类号
R-058 [];
学科分类号
摘要
BackgroundThis study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making.MethodsTwo dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics.ResultsChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning.ConclusionsChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Using Large Language Models to Support Content Analysis: A Case Study of ChatGPT for Adverse Event Detection
    Leas, Eric C.
    Ayers, John W.
    Desai, Nimit
    Dredze, Mark
    Hogarth, Michael
    Smith, Davey M.
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [42] Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis
    Mavrych, Volodymyr
    Ganguly, Paul
    Bolgova, Olena
    CLINICAL ANATOMY, 2025, 38 (02) : 200 - 210
  • [43] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
    Du, Wei
    Jin, Xueting
    Harris, Jaryse Carol
    Brunetti, Alessandro
    Johnson, Erika
    Leung, Olivia
    Li, Xingchen
    Walle, Selemon
    Yu, Qing
    Zhou, Xiao
    Bian, Fang
    Mckenzie, Kajanna
    Kanathanavanich, Manita
    Ozcelik, Yusuf
    El-Sharkawy, Farah
    Koga, Shunsuke
    ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73
  • [44] Do Large Language Models Produce Diverse Design Concepts? A Comparative Study with Human-Crowdsourced Solutions
    Ma, Kevin
    Grandi, Daniele
    Mccomb, Christopher
    Goucher-Lambert, Kosa
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [45] A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?
    Eren, Camur
    Turay, Cesur
    Celal, Guenes Yasin
    JOURNAL OF MEDICAL AND BIOLOGICAL ENGINEERING, 2024, 44 (06) : 821 - 830
  • [46] Adopting Pre-trained Large Language Models for Regional Language Tasks: A Case Study
    Gaikwad, Harsha
    Kiwelekar, Arvind
    Laddha, Manjushree
    Shahare, Shashank
    INTELLIGENT HUMAN COMPUTER INTERACTION, IHCI 2023, PT I, 2024, 14531 : 15 - 25
  • [47] Comparative Analysis of Large Language Models in Structured Information Extraction from Job Postings
    Sioziou, Kyriaki
    Zervas, Panagiotis
    Giotopoulos, Kostas
    Tzimas, Giannis
    ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2024, 2024, 2141 : 82 - 92
  • [48] Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction
    Chen, Boqi
    Yi, Fandi
    Varro, Daniel
    2023 ACM/IEEE INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS COMPANION, MODELS-C, 2023, : 588 - 596
  • [49] Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models
    MacNeil, Stephen
    Denny, Paul
    Tran, Andrew
    Leinonen, Juho
    Bernstein, Seth
    Hellas, Arto
    Sarsa, Sami
    Kim, Joanne
    PROCEEDINGS OF THE 26TH AUSTRALASIAN COMPUTING EDUCATION CONFERENCE, ACE 2024, 2024, : 11 - 18
  • [50] Unveiling the Impact of Large Language Models on Student Learning: A Comprehensive Case Study
    Zdravkova, Katerina
    Dalipi, Fisnik
    Ahlgren, Fredrik
    Ilijoski, Bojan
    Ohlsson, Tobias
    2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,