Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions

被引:2
作者
Song, Eun Sun [1 ]
Lee, Seung-Pyo [1 ]
机构
[1] Seoul Natl Univ, Sch Dent, Dent Res Inst, Dept Oral Anat, Seoul, South Korea
关键词
artificial intelligence; ChatGPT; dental hygienist; Gemini; large language models; licensing examination;
D O I
10.1111/idh.12848
中图分类号
R78 [口腔科学];
学科分类号
1003 ;
摘要
IntroductionLarge language models such as Gemini, GPT-3.5, and GPT-4 have demonstrated significant potential in the medical field. Their performance in medical licensing examinations globally has highlighted their capabilities in understanding and processing specialized medical knowledge. This study aimed to evaluate and compare the performance of Gemini, GPT-3.5, and GPT-4 in the Korean National Dental Hygienist Examination. The accuracy of answering the examination questions in both Korean and English was assessed.MethodsThis study used a dataset comprising questions from the Korean National Dental Hygienist Examination over 5 years (2019-2023). A two-way analysis of variance (ANOVA) test was employed to investigate the impacts of model type and language on the accuracy of the responses. Questions were input into each model under standardized conditions, and responses were classified as correct or incorrect based on predefined criteria.ResultsGPT-4 consistently outperformed the other models, achieving the highest accuracy rates across both language versions annually. In particular, it showed superior performance in English, suggesting advancements in its training algorithms for language processing. However, all models demonstrated variable accuracies in subjects with localized characteristics, such as health and medical law.ConclusionsThese findings indicate that GPT-4 holds significant promise for application in medical education and standardized testing, especially in English. However, the variability in performance across different subjects and languages underscores the need for ongoing improvements and the inclusion of more diverse and localized training datasets to enhance the models' effectiveness in multilingual and multicultural contexts.
引用
收藏
页码:267 / 276
页数:10
相关论文
共 28 条
[1]   Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions [J].
Abd-alrazaq, Alaa ;
AlSaad, Rawan ;
Alhuwail, Dari ;
Ahmed, Arfan ;
Healy, Padraig Mark ;
Latifi, Syed ;
Aziz, Sarah ;
Damseh, Rafat ;
Alrazak, Sadam Alabed ;
Sheikh, Javaid .
JMIR MEDICAL EDUCATION, 2023, 9
[2]   Explainability for artificial intelligence in healthcare: a multidisciplinary perspective [J].
Amann, Julia ;
Blasimme, Alessandro ;
Vayena, Effy ;
Frey, Dietmar ;
Madai, Vince I. .
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2020, 20 (01)
[3]  
Bohr A., 2020, Artificial Intelligence in Healthcare, P25, DOI [10.1016/B978-0-12-818438-7.00002-2, DOI 10.1016/B978-0-12-818438-7.00002-2]
[4]   Artificial Intelligence in Medicine: Today and Tomorrow [J].
Briganti, Giovanni ;
Le Moine, Olivier .
FRONTIERS IN MEDICINE, 2020, 7
[5]   ChatGPT sits the DFPH exam: large language model performance and potential to support public health learning [J].
Davies, Nathan P. ;
Wilson, Robert ;
Winder, Madeleine S. ;
Tunster, Simon J. ;
McVicar, Kathryn ;
Thakrar, Shivan ;
Williams, Joe ;
Reid, Allan .
BMC MEDICAL EDUCATION, 2024, 24 (01)
[6]  
Eager B, 2023, J UNIV TEACH LEARN P, V20
[7]  
Gangavarapu A., 2023, LLMS PROMISING NEW T, P252
[8]   How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment [J].
Gilson, Aidan ;
Safranek, Conrad W. ;
Huang, Thomas ;
Socrates, Vimig ;
Chi, Ling ;
Taylor, Richard Andrew ;
Chartash, David .
JMIR MEDICAL EDUCATION, 2023, 9
[9]  
Haleem A., 2019, Current Medicine Research and Practice, V9, P231, DOI [10.1016/j.cmrp.2019.11.005, DOI 10.1016/J.CMRP.2019.11.005]
[10]  
Isabwe G. M. N., 2013, REVISITING STUDENTSP, P256