Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions

被引:23
作者
Laupichler, Matthias Carl [1 ,2 ]
Rother, Johanna Flora [1 ]
Kadow, Ilona C. Grunwald [3 ]
Ahmadi, Seifollah [3 ]
Raupach, Tobias [1 ]
机构
[1] Univ Hosp Bonn, Inst Med Educ, Venusberg Campus 1, D-53127 Bonn, Germany
[2] Univ Bonn, Inst Psychol, Bonn, Germany
[3] Univ Bonn, Inst Physiol 2, Dept Med, Bonn, Germany
关键词
D O I
10.1097/ACM.0000000000005626
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Problem Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans. Approach The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. Outcomes The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. Next Steps Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.
引用
收藏
页码:508 / 512
页数:5
相关论文
共 50 条
  • [31] Variability in Large Language Models' Responses to Medical Licensing and Certification Examinations. Comment on "How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment"
    Epstein, Richard H.
    Dexter, Franklin
    JMIR MEDICAL EDUCATION, 2023, 9
  • [32] CHATGPT, MD: A PILOT STUDY UTILIZING LARGE LANGUAGE MODELS TO WRITE MEDICAL ABSTRACTS
    Holland, A.
    Lorenz, W.
    Cavanaugh, J.
    Ayuso, S.
    Scarola, G.
    Jorgensen, L.
    Kercher, K.
    Smart, N.
    Fischer, J.
    Janis, J.
    Heniford, B. T.
    BRITISH JOURNAL OF SURGERY, 2024, 111
  • [33] INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations
    Ramakrishna, Anil
    Gupta, Rahul
    Lehmann, Jens
    Ziyadi, Morteza
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5422 - 5429
  • [34] Future Potential Challenges of Using Large Language Models Like ChatGPT in Medical Practice
    Sedaghat, Sam
    JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2024, 21 (02) : 344 - 345
  • [35] Exploring the Potential of ChatGPT in Nursing Education: A Comparative Analysis of Human and AI-Generated NCLEX Questions
    Cox, Rachel
    Hunt, Karen
    Hill, Rebecca
    NURSING RESEARCH, 2024, 73 (03) : E75 - E75
  • [36] AI-assisted Learning with ChatGPT and Large Language Models: Implications for Higher Education
    Laato, Samuli
    Morschheuser, Benedikt
    Hamari, Juho
    Bjorne, Jari
    2023 IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, ICALT, 2023, : 226 - 230
  • [37] A systematic review of large language models and their implications in medical education
    Lucas, Harrison C.
    Upperman, Jeffrey S.
    Robinson, Jamie R.
    MEDICAL EDUCATION, 2024, 58 (11) : 1276 - 1285
  • [38] The Role of Large Language Models in Medical Education: Applications and Implications
    Safranek, Conrad W.
    Sidamon-Eristoff, Anne Elizabeth
    Gilson, Aidan
    Chartash, David
    JMIR MEDICAL EDUCATION, 2023, 9
  • [39] ChatGPT for Education Research: Exploring the Potential of Large Language Models for Qualitative Codebook Development
    Barany, Amanda
    Nasiar, Nidhi
    Porter, Chelsea
    Zambrano, Andres Felipe
    Andres, Alexandra L.
    Bright, Dara
    Shah, Mamta
    Liu, Xiner
    Gao, Sabrina
    Zhang, Jiayi
    Mehta, Shruti
    Choi, Jaeyoon
    Giordano, Camille
    Baker, Ryan S.
    ARTIFICIAL INTELLIGENCE IN EDUCATION, PT II, AIED 2024, 2024, 14830 : 134 - 149
  • [40] Impact of Large Language Models on Medical Education andTeaching Adaptations
    Li, Zhui
    Yhap, Nina
    Liu, Liping
    Wang, Zhengjie
    Xiong, Zhonghao
    Yuan, Xiaoshu
    Cui, Hong
    Liu, Xuexiu
    Ren, Wei
    JMIR MEDICAL INFORMATICS, 2024, 12