Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study

被引:0
|
作者
Wu, Yuepeng [1 ]
Zhang, Yukang [2 ]
Xu, Mei [3 ]
Chen, Jinzhi [4 ]
Xue, Yican [5 ]
Zheng, Yuchen [1 ]
机构
[1] Zhejiang Prov Peoples Hosp, Affiliated Peoples Hosp, Hangzhou Med Coll, Ctr Plast & Reconstruct Surg,Dept Stomatol, Hangzhou, Zhejiang, Peoples R China
[2] Xianju Tradit Chinese Med Hosp, Taizhou, Zhejiang, Peoples R China
[3] Hangzhou Dent Hosp, West Branch, Hangzhou, Zhejiang, Peoples R China
[4] Hohai Univ, Coll Oceanog, Nanjing, Jiangsu, Peoples R China
[5] Hangzhou Med Coll, Hangzhou, Zhejiang, Peoples R China
关键词
Large language models; Artificial intelligence; Dental implantology; Clinical decision-making; Case analysis; KNOWLEDGE; QUALITY;
D O I
10.1186/s12911-025-02972-2
中图分类号
R-058 [];
学科分类号
摘要
BackgroundThis study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making.MethodsTwo dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics.ResultsChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning.ConclusionsChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Evaluation and Analysis of Large Language Models for Clinical Text Augmentation and Generation
    Latif, Atif
    Kim, Jihie
    IEEE ACCESS, 2024, 12 : 48987 - 48996
  • [32] Theory of mind performance of large language models: A comparative analysis of Turkish and English
    Unlutabak, Burcu
    Bal, Onur
    COMPUTER SPEECH AND LANGUAGE, 2025, 89
  • [33] Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study
    Masanneck, Lars
    Schmidt, Linea
    Seifert, Antonia
    Koelsche, Tristan
    Huntemann, Niklas
    Jansen, Robin
    Mehsin, Mohammed
    Bernhard, Michael
    Meuth, Sven G.
    Boehm, Lennert
    Pawlitzki, Marc
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [34] A Holistic Comparative Study of Large Language Models as Emotional Support Dialogue Systems
    Bai, Xin
    Chen, Guanyi
    He, Tingting
    Zhou, Chenlian
    Guo, Cong
    COGNITIVE COMPUTATION, 2025, 17 (02)
  • [35] Performance of large language models in the National Dental Licensing Examination in China: a comparative analysis of ChatGPT, GPT-4, and New Bing
    Hu, Ziyang
    Xu, Zhe
    Shi, Ping
    Zhang, Dandan
    Yue, Qu
    Zhang, Jiexia
    Lei, Xin
    Lin, Zitong
    INTERNATIONAL JOURNAL OF COMPUTERIZED DENTISTRY, 2024, 27 (04)
  • [36] Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat
    Yamaguchi, Shino
    Morishita, Masaki
    Fukuda, Hikaru
    Muraoka, Kosuke
    Nakamura, Taiji
    Yoshioka, Izumi
    Soh, Inho
    Ono, Kentaro
    Awano, Shuji
    JOURNAL OF DENTAL SCIENCES, 2024, 19 (04) : 2262 - 2267
  • [37] Revolutionizing Talent Acquisition: A Comparative Study of Large Language Models in Resume Classification
    Venkatakrishnan, R.
    Rithani, M.
    Mohan, Bharathi G.
    Sulochana, V
    PrasannaKumar, R.
    2024 5TH INTERNATIONAL CONFERENCE ON INNOVATIVE TRENDS IN INFORMATION TECHNOLOGY, ICITIIT 2024, 2024,
  • [38] Look Before You Leap: An Exploratory Study of Uncertainty Analysis for Large Language Models
    Huang, Yuheng
    Song, Jiayang
    Wang, Zhijie
    Zhao, Shengming
    Chen, Huaming
    Juefei-Xu, Felix
    Ma, Lei
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2025, 51 (02) : 413 - 429
  • [39] Predicting Immunotherapy Response in Unresectable Hepatocellular Carcinoma: A Comparative Study of Large Language Models and Human Experts
    Jun Xu
    Junjie Wang
    Junjun Li
    Zhangxiang Zhu
    Xiao Fu
    Wei Cai
    Ruipeng Song
    Tengfei Wang
    Hai Li
    Journal of Medical Systems, 49 (1)
  • [40] Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data
    Barr, Austin A.
    Quan, Joshua
    Guo, Eddie
    Sezgin, Emre
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2025, 8