Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education

被引:13
作者
Sabri, Hamoun [1 ,2 ]
Saleh, Muhammad H. A. [1 ]
Hazrati, Parham [1 ]
Merchant, Keith [3 ]
Misch, Jonathan [1 ]
Kumar, Purnima S. [1 ]
Wang, Hom-Lay [1 ]
Barootchi, Shayan [1 ,2 ,4 ]
机构
[1] Univ Michigan, Sch Dent, Dept Periodont & Oral Med, 1011 N Univ Ave, Ann Arbor, MI 48109 USA
[2] Ctr Clin Res & Evidence Synth Oral Tissue Regenera, Ann Arbor, MI USA
[3] Naval Postgrad Dent Sch, Bethesda, MD USA
[4] Harvard Sch Dent Med, Dept Oral Med Infect & Immun, Div Periodontol, Boston, MA 02115 USA
关键词
American Academy of Periodontology; artificial intelligence; ChatGPT; ChatGPT-3.5; ChatGPT-4; dental education; Gemini; Google Bard; Google Gemini; periodontal education; CHATGPT; RISKS;
D O I
10.1111/jre.13323
中图分类号
R78 [口腔科学];
学科分类号
1003 ;
摘要
Introduction: The emerging rise in novel computer technologies and automated data analytics has the potential to change the course of dental education. In line with our long-term goal of harnessing the power of AI to augment didactic teaching, the objective of this study was to quantify and compare the accuracy of responses provided by ChatGPT (GPT-4 and GPT-3.5) and Google Gemini, the three primary large language models (LLMs), to human graduate students (control group) to the annual in-service examination questions posed by the American Academy of Periodontology (AAP). Methods: Under a comparative cross-sectional study design, a corpus of 1312 questions from the annual in-service examination of AAP administered between 2020 and 2023 were presented to the LLMs. Their responses were analyzed using chi-square tests, and the performance was juxtaposed to the scores of periodontal residents from corresponding years, as the human control group. Additionally, two sub-analyses were performed: one on the performance of the LLMs on each section of the exam; and in answering the most difficult questions. Results: ChatGPT-4 (total average: 79.57%) outperformed all human control groups as well as GPT-3.5 and Google Gemini in all exam years (p < .001). This chatbot showed an accuracy range between 78.80% and 80.98% across the various exam years. Gemini consistently recorded superior performance with scores of 70.65% (p = .01), 73.29% (p = .02), 75.73% (p < .01), and 72.18% (p = .0008) for the exams from 2020 to 2023 compared to ChatGPT-3.5, which achieved 62.5%, 68.24%, 69.83%, and 59.27% respectively. Google Gemini (72.86%) surpassed the average scores achieved by first- (63.48% +/- 31.67) and second-year residents (66.25% +/- 31.61) when all exam years combined. However, it could not surpass that of third-year residents (69.06% +/- 30.45). Conclusions: Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.
引用
收藏
页码:121 / 133
页数:13
相关论文
共 31 条
[1]  
Achiam J., GPT-4 Technical Report
[2]  
Ali Rohaid, 2023, Neurosurgery, V93, P1090, DOI 10.1227/neu.0000000000002551
[3]   Science in the age of large language models [J].
Birhane, Abeba ;
Kasirzadeh, Atoosa ;
Leslie, David ;
Wachter, Sandra .
NATURE REVIEWS PHYSICS, 2023, 5 (05) :277-280
[4]  
Brown TB, 2020, ADV NEUR IN, V33
[5]  
Cocci A, 2024, PROSTATE CANCER P D, V27, P103, DOI 10.1038/s41391-023-00705-y
[6]   Artificial intelligence in dental education: ChatGPT's performance on the periodontic in-service examination [J].
Danesh, Arman ;
Pazouki, Hirad ;
Danesh, Farzad ;
Danesh, Arsalan ;
Vardar-Sengul, Saynur .
JOURNAL OF PERIODONTOLOGY, 2024, 95 (07) :682-687
[7]   The performance of artificial intelligence language models in board-style dental knowledge assessment A preliminary study on ChatGPT [J].
Danesh, Arman ;
Pazouki, Hirad ;
Danesh, Kasra ;
Danesh, Farzad ;
Danesh, Arsalan .
JOURNAL OF THE AMERICAN DENTAL ASSOCIATION, 2023, 154 (11) :970-974
[8]   "So what if ChatGPT wrote it?" Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy [J].
Dwivedi, Yogesh K. ;
Kshetri, Nir ;
Hughes, Laurie ;
Slade, Emma Louise ;
Jeyaraj, Anand ;
Kar, Arpan Kumar ;
Baabdullah, Abdullah M. ;
Koohang, Alex ;
Raghavan, Vishnupriya ;
Ahuja, Manju ;
Albanna, Hanaa ;
Albashrawi, Mousa Ahmad ;
Al-Busaidi, Adil S. ;
Balakrishnan, Janarthanan ;
Barlette, Yves ;
Basu, Sriparna ;
Bose, Indranil ;
Brooks, Laurence ;
Buhalis, Dimitrios ;
Carter, Lemuria ;
Chowdhury, Soumyadeb ;
Crick, Tom ;
Cunningham, Scott W. ;
Davies, Gareth H. ;
Davison, Robert M. ;
De, Rahul ;
Dennehy, Denis ;
Duan, Yanqing ;
Dubey, Rameshwar ;
Dwivedi, Rohita ;
Edwards, John S. ;
Flavian, Carlos ;
Gauld, Robin ;
Grover, Varun ;
Hu, Mei-Chih ;
Janssen, Marijn ;
Jones, Paul ;
Junglas, Iris ;
Khorana, Sangeeta ;
Kraus, Sascha ;
Larsen, Kai R. ;
Latreille, Paul ;
Laumer, Sven ;
Malik, F. Tegwen ;
Mardani, Abbas ;
Mariani, Marcello ;
Mithas, Sunil ;
Mogaji, Emmanuel ;
Nord, Jeretta Horn ;
O'Connor, Siobhan .
INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT, 2023, 71
[9]  
Farajollahi Mehran, 2023, Iran Endod J, V18, P192
[10]   ChatGPT performance in prosthodontics: Assessment of accuracy and repeatability in answer generation [J].
Freire, Yolanda ;
Laorden, Andrea Santamaria ;
Perez, Jaime Orejas ;
Sanchez, Margarita Gomez ;
Garcia, Victor Diaz -Flores ;
Suarez, Ana .
JOURNAL OF PROSTHETIC DENTISTRY, 2024, 131 (04) :659.e1-659.e6