Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study

被引:0
作者
Kim, Hak-Sun [1 ]
Kim, Gyu-Tae [2 ]
机构
[1] Kyung Hee Univ, Dept Oral & Maxillofacial Radiol, Dent Hosp, Seoul, South Korea
[2] Kyung Hee Univ, Coll Dent, Dept Oral & Maxillofacial Surg, 26 Kyungheedae Ro, Seoul 02447, South Korea
关键词
Dental education; Examination questions; Professional competence; Artificial intelligence; Natural language processing;
D O I
10.1016/j.jds.2024.08.020
中图分类号
R78 [口腔科学];
学科分类号
1003 ;
摘要
Background/purpose: Numerous studies have shown that large language models (LLMs) can score above the passing grade on various board examinations. Therefore, this study aimed to evaluate national dental board-style examination questions created by an LLM versus those created by human experts using item analysis. Materials and methods: This study was conducted in June 2024 and included senior dental students (n = 30) who participated voluntarily. An LLM, ChatGPT 4o, was used to generate 44 national dental board-style examination questions based on textbook content. Twenty questions for the LLM set were randomly selected after removing false questions. Two experts created another set of 20 questions based on the same content and in the same style as the LLM. Participating students simultaneously answered a total of 40 questions divided into two sets using Google Forms in the classroom. The responses were analyzed to assess difficulty, discrimination index, and distractor efficiency. Statistical comparisons were performed using the Wilcoxon signed rank test or linear-by-linear association test, with a confidence level of 95%. Results: The response rate was 100%. The median difficulty indices of the LLM and human set were 55.00% and 50.00%, both within the range of "excellent" range. The median discrimination indices were 0.29 for the LLM set and 0.14 for the human set. Both sets had a median distractor efficiency of 80.00%. The differences in all criteria were not statistically significant (P > 0.050). Conclusion: The LLM can create national board-style examination questions of equivalent quality to those created by human experts. (c) 2025 Association for Dental Sciences of the Republic of China. Publishing services by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons. org/licenses/by-nc-nd/4.0/).
引用
收藏
页码:895 / 900
页数:6
相关论文
共 29 条
  • [1] Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions
    Abbas, Ali
    Rehman, Mahad S.
    Rehman, Syed S.
    [J]. CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (03)
  • [2] Leveraging Large Language Models in the delivery of post-operative dental care: a comparison between an embedded GPT model and ChatGPT
    Batool, Itrat
    Naved, Nighat
    Kazmi, Syed Murtaza Raza
    Umer, Fahad
    [J]. BDJ OPEN, 2024, 10 (01)
  • [3] Creating Virtual Patients using Robots and Large Language Models: A Preliminary Study with Medical Students
    Borg, Alexander
    Parodis, Ioannis
    Skantze, Gabriel
    [J]. COMPANION OF THE 2024 ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION, HRI 2024 COMPANION, 2024, : 273 - 277
  • [4] ChatGPT sits the DFPH exam: large language model performance and potential to support public health learning
    Davies, Nathan P.
    Wilson, Robert
    Winder, Madeleine S.
    Tunster, Simon J.
    McVicar, Kathryn
    Thakrar, Shivan
    Williams, Joe
    Reid, Allan
    [J]. BMC MEDICAL EDUCATION, 2024, 24 (01)
  • [5] Clinically applicable deep learning for diagnosis and referral in retinal disease
    De Fauw, Jeffrey
    Ledsam, Joseph R.
    Romera-Paredes, Bernardino
    Nikolov, Stanislav
    Tomasev, Nenad
    Blackwell, Sam
    Askham, Harry
    Glorot, Xavier
    O'Donoghue, Brendan
    Visentin, Daniel
    van den Driessche, George
    Lakshminarayanan, Balaji
    Meyer, Clemens
    Mackinder, Faith
    Bouton, Simon
    Ayoub, Kareem
    Chopra, Reena
    King, Dominic
    Karthikesalingam, Alan
    Hughes, Cian O.
    Raine, Rosalind
    Hughes, Julian
    Sim, Dawn A.
    Egan, Catherine
    Tufail, Adnan
    Montgomery, Hugh
    Hassabis, Demis
    Rees, Geraint
    Back, Trevor
    Khaw, Peng T.
    Suleyman, Mustafa
    Cornebise, Julien
    Keane, Pearse A.
    Ronneberger, Olaf
    [J]. NATURE MEDICINE, 2018, 24 (09) : 1342 - +
  • [6] Ebel R.L., 1991, Essentials of educational measurement, V5, P220
  • [7] Fang Qiao, 2024, J Prosthet Dent, DOI 10.1016/j.prosdent.2024.03.038
  • [8] Assessing the diagnostic performance of large language models with European Diploma in Musculoskeletal Radiology (EDiMSK) examination sample questions
    Gunes, Yasin Celal
    Cesur, Turay
    [J]. JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (06) : 673 - 674
  • [9] Diagnostic Accuracy of Large Language Models in the European Board of Interventional Radiology Examination (EBIR) Sample Questions
    Gunes, Yasin Celal
    Cesur, Turay
    [J]. CARDIOVASCULAR AND INTERVENTIONAL RADIOLOGY, 2024, 47 (06) : 836 - 837
  • [10] Evaluation of the quality of multiple-choice questions according to the students' academic level
    Inarrairaegui, Mercedes
    Fernandez-Ros, Nerea
    Lucena, Felipe
    Landecho, Manuel F.
    Garcia, Nicolas
    Quiroga, Jorge
    Ignacio Herrero, Jose
    [J]. BMC MEDICAL EDUCATION, 2022, 22 (01)