Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study

被引:2
作者
Mendonça de Moura, João Daniel [1 ]
Fontana, Carlos Eduardo [2 ]
Reis da Silva Lima, Vitor Henrique [3 ]
de Souza Alves, Iris [4 ]
André de Melo Santos, Paulo [1 ]
de Almeida Rodrigues, Patrícia [1 ]
机构
[1] Postgraduate Program in Clinical Dentistry, University Center of Pará (CESUPA), Pará, Belém
[2] Center for Health Sciences, Pontifical Catholic University of Campinas (PUC-Campinas), Postgraduate Program in Health Sciences, Campinas, São Paulo
[3] Endodontics Specialization Program, University Center of Pará (CESUPA), Pará, Belém
[4] Dentistry Program, University Center of Pará (CESUPA), Pará, Belém
来源
Computers in Biology and Medicine | / 183卷
关键词
Artificial intelligence; Dental pulp diseases; Diagnosis; Endodontics; Machine learning;
D O I
10.1016/j.compbiomed.2024.109332
中图分类号
学科分类号
摘要
Objectives: This study aimed to evaluate the diagnostic accuracy and treatment recommendation performance of four artificial intelligence chatbots in fictional pulpal and periradicular disease cases. Additionally, it investigated response consistency and the influence of text order and language on chatbot performance. Methods: In this cross-sectional comparative study, eleven cases representing various pulpal and periradicular pathologies were created. These cases were presented to four chatbots (ChatGPT 3.5, ChatGPT 4.0, Bard, and Bing) in both Portuguese and English, with the information order varied (signs and symptoms first or imaging data first). Statistical analyses included the Kruskal-Wallis test, Dwass-Steel-Critchlow-Fligner pairwise comparisons, simple logistic regression, and the binomial test. Results: Bing and ChatGPT 4.0 achieved the highest diagnostic accuracy rates (86.4 % and 85.3 % respectively), significantly outperforming ChatGPT 3.5 (46.5 %) and Bard (28.6 %) (p < 0.001). For treatment recommendations, ChatGPT 4.0, Bing, and ChatGPT 3.5 performed similarly (94.4 %, 93.2 %, and 86.3 %, respectively), while Bard exhibited significantly lower accuracy (75 %, p < 0.001). No significant association between diagnosis and treatment accuracy was found for Bard and Bing, but a positive association was observed for ChatGPT 3.5 and ChatGPT 4.0 (p < 0.05). The overall consistency rate was 98.29 %, with no significant differences related to text order or language. Cases presented in Portuguese prompted significantly more additional information requests than those in English (33.5 % vs. 10.2 %; p < 0.001), with the relevance of this information being higher in Portuguese (29.5 % vs. 8.5 %; p < 0.001). Conclusions: Bing and ChatGPT 4.0 demonstrated superior diagnostic accuracy, while Bard showed the lowest accuracy in both diagnosis and treatment recommendations. However, the clinical application of these tools necessitates critical interpretation by dentists, as chatbot responses are not consistently reliable. © 2024 Elsevier Ltd
引用
收藏
相关论文
共 31 条
[1]  
Ricucci D., Loghin S., Siqueira J.F., Correlation between clinical and histologic pulp diagnoses, J. Endod., 40, pp. 1932-1939, (2014)
[2]  
Lin L.M., Ricucci D., Saoud T.M., Sigurdsson A., Kahler B., Vital pulp therapy of mature permanent teeth with irreversible pulpitis from the perspective of pulp biology, Aust. Endod. J., 46, pp. 154-166, (2020)
[3]  
Bux M., Adam M., Accuracy of vitality and sensibility testing in mature and immature anterior teeth: a clinical trial, Evid. Base Dent., (2024)
[4]  
Karamifar K., Tondari A., Saghiri M.A., Endodontic periapical lesion: an overview on the etiology, diagnosis and current treatment modalities, Eur Endod J, 5, pp. 54-67, (2020)
[5]  
Chan F., Brown L., Parashos P., CBCT in contemporary endodontics, Aust. Dent. J., (2023)
[6]  
Al-Madi E.M., Al-Bahrani L., Al-Shenaiber R., Al-Saleh S.A., Al-Obaida M.I., Creation and evaluation of an endodontic diagnosis training software, Int J Dent, 2020, (2020)
[7]  
Choi E., Pang K.M., Jeong E., Lee S., Son Y., Seo M.S., Artificial intelligence in diagnosing dens evaginatus on periapical radiography with limited data availability, Sci. Rep., 13, (2023)
[8]  
Gunec H.G., Urkmez E.S., Danaci A., Dilmac E., Onay H.H., Aydin K.C., Comparison of artificial intelligence vs. junior dentists' diagnostic performance based on caries and periapical infection detection on panoramic images, Quant. Imag. Med. Surg., 13, pp. 7494-7503, (2023)
[9]  
Rossettini G., Cook C., Palese A., Pillastrini P., Turolla A., Pros and cons of using artificial intelligence chatbots for musculoskeletal rehabilitation management, J. Orthop. Sports Phys. Ther., 53, pp. 728-734, (2023)
[10]  
Rossettini G., Rodeghiero L., Corradi F., Cook C., Pillastrini P., Turolla A., Castellini G., Chiappinotto S., Gianola S., Palese A., Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study, BMC Med. Educ., 24, (2024)