Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing

被引:15
|
作者
Makrygiannakis, Miltiadis A. [1 ,2 ,5 ]
Giannakopoulos, Kostis [2 ]
Kaklamanos, Eleftherios G. [2 ,3 ,4 ]
机构
[1] Natl & Kapodistrian Univ Athens, Sch Dent, Athens 11527, Greece
[2] European Univ Cyprus, Sch Dent, CY-2404 Nicosia, Cyprus
[3] Aristotle Univ Thessaloniki, Sch Dent, Thessaloniki 54124, Greece
[4] Mohammed Bin Rashid Univ Med & Hlth Sci MBRU, Hamdan Bin Mohammed Coll Dent Med, Dubai, U Arab Emirates
[5] Natl & Kapodistrian Univ Athens, Sch Dent, 2 Thivon St, Athens 11527, Greece
关键词
orthodontics; large language models; ChatGPT; Google bard; Microsoft bing chat;
D O I
10.1093/ejo/cjae017
中图分类号
R78 [口腔科学];
学科分类号
1003 ;
摘要
Background The increasing utilization of large language models (LLMs) in Generative Artificial Intelligence across various medical and dental fields, and specifically orthodontics, raises questions about their accuracy.Objective This study aimed to assess and compare the answers offered by four LLMs: Google's Bard, OpenAI's ChatGPT-3.5, and ChatGPT-4, and Microsoft's Bing, in response to clinically relevant questions within the field of orthodontics.Materials and methods Ten open-type clinical orthodontics-related questions were posed to the LLMs. The responses provided by the LLMs were assessed on a scale ranging from 0 (minimum) to 10 (maximum) points, benchmarked against robust scientific evidence, including consensus statements and systematic reviews, using a predefined rubric. After a 4-week interval from the initial evaluation, the answers were reevaluated to gauge intra-evaluator reliability. Statistical comparisons were conducted on the scores using Friedman's and Wilcoxon's tests to identify the model providing the answers with the most comprehensiveness, scientific accuracy, clarity, and relevance.Results Overall, no statistically significant differences between the scores given by the two evaluators, on both scoring occasions, were detected, so an average score for every LLM was computed. The LLM answers scoring the highest, were those of Microsoft Bing Chat (average score = 7.1), followed by ChatGPT 4 (average score = 4.7), Google Bard (average score = 4.6), and finally ChatGPT 3.5 (average score 3.8). While Microsoft Bing Chat statistically outperformed ChatGPT-3.5 (P-value = 0.017) and Google Bard (P-value = 0.029), as well, and Chat GPT-4 outperformed Chat GPT-3.5 (P-value = 0.011), all models occasionally produced answers with a lack of comprehensiveness, scientific accuracy, clarity, and relevance.Limitations The questions asked were indicative and did not cover the entire field of orthodontics.Conclusions Language models (LLMs) show great potential in supporting evidence-based orthodontics. However, their current limitations pose a potential risk of making incorrect healthcare decisions if utilized without careful consideration. Consequently, these tools cannot serve as a substitute for the orthodontist's essential critical thinking and comprehensive subject knowledge. For effective integration into practice, further research, clinical validation, and enhancements to the models are essential. Clinicians must be mindful of the limitations of LLMs, as their imprudent utilization could have adverse effects on patient care.
引用
收藏
页数:7
相关论文
共 24 条
  • [1] Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study
    Giannakopoulos, Kostis
    Kavadella, Argyro
    Salim, Anas Aaqel
    Stamatopoulos, Vassilis
    Kaklamanos, Eleftherios G.
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [2] Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing
    Kumari, Amita
    Kumari, Anita
    Singh, Amita
    Singh, Sanjeet K.
    Juhi, Ayesha
    Dhanvijay, Anup Kumar D.
    Pinjar, Mohammed Jaffer
    Mondal, Himel
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [3] Evidence-Based Potential of Generative Artificial Intelligence Large Language Models on Dental Avulsion: ChatGPT Versus Gemini
    Kaplan, Taibe Tokgoz
    Cankar, Muhammet
    DENTAL TRAUMATOLOGY, 2025, 41 (02) : 178 - 186
  • [4] Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence
    Dermata, Anastasia
    Arhakis, Aristidis
    Makrygiannakis, Miltiadis A.
    Giannakopoulos, Kostis
    Kaklamanos, Eleftherios G.
    EUROPEAN ARCHIVES OF PAEDIATRIC DENTISTRY, 2025,
  • [5] Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology
    Dhanvijay, Anup Kumar D.
    Pinjar, Mohammed Jaffer
    Dhokane, Nitin
    Sorte, Smita R.
    Kumari, Amita
    Mondal, Himel
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [6] A Generative Artificial Intelligence Using Multilingual Large Language Models for ChatGPT Applications
    Tuan, Nguyen Trung
    Moore, Philip
    Thanh, Dat Ha Vu
    Pham, Hai Van
    APPLIED SCIENCES-BASEL, 2024, 14 (07):
  • [7] Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
    Lim, Zhi Wei
    Pushpanathan, Krithi
    Yew, Samantha Min Er
    Lai, Yien
    Sun, Chen-Hsin
    Lam, Janice Sing Harn
    Chen, David Ziyou
    Goh, Jocelyn Hui Lin
    Tan, Marcus Chun Jin
    Sheng, Bin
    Cheng, Ching-Yu
    Koh, Victor Teck Chang
    Tham, Yih-Chung
    EBIOMEDICINE, 2023, 95
  • [8] Artificial Intelligence in Academic Translation: A Comparative Study of Large Language Models and Google Translate
    Mohsen, Mohammed Ali
    PSYCHOLINGUISTICS, 2024, 35 (02): : 134 - 156
  • [9] Generative Artificial Intelligence Through ChatGPT and Other Large Language Models in Ophthalmology Clinical Applications and Challenges
    Tan, Ting Fang
    Thirunavukarasu, Arun James
    Campbell, J. Peter
    Keane, Pearse A.
    Pasquale, Louis R.
    Abramoff, Michael D.
    Kalpathy-Cramer, Jayashree
    Lum, Flora
    Kim, Judy E.
    Baxter, Sally L.
    Ting, Daniel Shu Wei
    OPHTHALMOLOGY SCIENCE, 2023, 3 (04):
  • [10] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
    Du, Wei
    Jin, Xueting
    Harris, Jaryse Carol
    Brunetti, Alessandro
    Johnson, Erika
    Leung, Olivia
    Li, Xingchen
    Walle, Selemon
    Yu, Qing
    Zhou, Xiao
    Bian, Fang
    Mckenzie, Kajanna
    Kanathanavanich, Manita
    Ozcelik, Yusuf
    El-Sharkawy, Farah
    Koga, Shunsuke
    ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73