Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing

被引:15
|
作者
Makrygiannakis, Miltiadis A. [1 ,2 ,5 ]
Giannakopoulos, Kostis [2 ]
Kaklamanos, Eleftherios G. [2 ,3 ,4 ]
机构
[1] Natl & Kapodistrian Univ Athens, Sch Dent, Athens 11527, Greece
[2] European Univ Cyprus, Sch Dent, CY-2404 Nicosia, Cyprus
[3] Aristotle Univ Thessaloniki, Sch Dent, Thessaloniki 54124, Greece
[4] Mohammed Bin Rashid Univ Med & Hlth Sci MBRU, Hamdan Bin Mohammed Coll Dent Med, Dubai, U Arab Emirates
[5] Natl & Kapodistrian Univ Athens, Sch Dent, 2 Thivon St, Athens 11527, Greece
关键词
orthodontics; large language models; ChatGPT; Google bard; Microsoft bing chat;
D O I
10.1093/ejo/cjae017
中图分类号
R78 [口腔科学];
学科分类号
1003 ;
摘要
Background The increasing utilization of large language models (LLMs) in Generative Artificial Intelligence across various medical and dental fields, and specifically orthodontics, raises questions about their accuracy.Objective This study aimed to assess and compare the answers offered by four LLMs: Google's Bard, OpenAI's ChatGPT-3.5, and ChatGPT-4, and Microsoft's Bing, in response to clinically relevant questions within the field of orthodontics.Materials and methods Ten open-type clinical orthodontics-related questions were posed to the LLMs. The responses provided by the LLMs were assessed on a scale ranging from 0 (minimum) to 10 (maximum) points, benchmarked against robust scientific evidence, including consensus statements and systematic reviews, using a predefined rubric. After a 4-week interval from the initial evaluation, the answers were reevaluated to gauge intra-evaluator reliability. Statistical comparisons were conducted on the scores using Friedman's and Wilcoxon's tests to identify the model providing the answers with the most comprehensiveness, scientific accuracy, clarity, and relevance.Results Overall, no statistically significant differences between the scores given by the two evaluators, on both scoring occasions, were detected, so an average score for every LLM was computed. The LLM answers scoring the highest, were those of Microsoft Bing Chat (average score = 7.1), followed by ChatGPT 4 (average score = 4.7), Google Bard (average score = 4.6), and finally ChatGPT 3.5 (average score 3.8). While Microsoft Bing Chat statistically outperformed ChatGPT-3.5 (P-value = 0.017) and Google Bard (P-value = 0.029), as well, and Chat GPT-4 outperformed Chat GPT-3.5 (P-value = 0.011), all models occasionally produced answers with a lack of comprehensiveness, scientific accuracy, clarity, and relevance.Limitations The questions asked were indicative and did not cover the entire field of orthodontics.Conclusions Language models (LLMs) show great potential in supporting evidence-based orthodontics. However, their current limitations pose a potential risk of making incorrect healthcare decisions if utilized without careful consideration. Consequently, these tools cannot serve as a substitute for the orthodontist's essential critical thinking and comprehensive subject knowledge. For effective integration into practice, further research, clinical validation, and enhancements to the models are essential. Clinicians must be mindful of the limitations of LLMs, as their imprudent utilization could have adverse effects on patient care.
引用
收藏
页数:7
相关论文
共 24 条
  • [21] Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models
    Li, Kun-peng
    Wang, Li
    Wan, Shun
    Wang, Chen-yang
    Chen, Si-yu
    Liu, Shan-hui
    Yang, Li
    JOURNAL OF ENDOUROLOGY, 2025,
  • [22] Harnessing Large Language Models for Structured Reporting in Breast Ultrasound: A Comparative Study of Open AI (GPT-4.0) and Microsoft Bing (GPT-4)
    Liu, ChaoXu
    Wei, MinYan
    Qin, Yu
    Zhang, MeiXiang
    Jiang, Huan
    Xu, JiaLe
    Zhang, YuNing
    Hua, Qing
    Hou, YiQing
    Dong, YiJie
    Xia, ShuJun
    Li, Ning
    Zhou, JianQiao
    ULTRASOUND IN MEDICINE AND BIOLOGY, 2024, 50 (11) : 1697 - 1703
  • [23] Performance of the Large Language Models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence
    Yannick Laurent Tchenadoyo Bayala
    Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo
    Dieu-Donné Ouedraogo
    Fulgence Kaboré
    Charles Sougué
    Aristide Relwendé Yameogo
    Wendlassida Martin Nacanabo
    Ismael Ayouba Tinni
    Aboubakar Ouedraogo
    Yamyellé Enselme Zongo
    BMC Rheumatology, 9 (1)
  • [24] Exploring the potential of large language models in identifying metabolic dysfunction-associated steatotic liver disease: A comparative study of non-invasive tests and artificial intelligence-generated responses
    Wu, Wanying
    Guo, Yuhu
    Li, Qi
    Jia, Congzhuo
    LIVER INTERNATIONAL, 2025, 45 (04)