Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

被引:16
|
作者
Herrmann-Werner, Anne [1 ,2 ]
Festl-Wietek, Teresa [1 ]
Holderried, Friederike [1 ,3 ]
Herschbach, Lea [1 ]
Griewatz, Jan [1 ]
Masters, Ken [4 ]
Zipfel, Stephan [2 ]
Mahling, Moritz [1 ,5 ]
机构
[1] Univ Tubingen, Tubingen Inst Med Educ, Fac Med, Elfriede Aulhorn Str 10, D-72076 Tubingen, Germany
[2] Univ Hosp Tubingen, Dept Psychosomat Med & Psychotherapy, Tubingen, Germany
[3] Univ Hosp Tubingen, Univ Dept Anesthesiol & Intens Care Med, Tubingen, Germany
[4] Sultan Qaboos Univ, Coll Med & Hlth Sci, Med Educ & Informat Dept, Muscat, Oman
[5] Univ Hosp Tubingen, Dept Diabetol Endocrinol Nephrol, Sect Nephrol & Hypertens, Tubingen, Germany
关键词
answer; artificial intelligence; assessment; Bloom's taxonomy; ChatGPT; classification; error; exam; examination; generative; GPT-4; Generative Pre-trained Transformer 4; language model; learning outcome; LLM; MCQ; medical education; medical exam; multiple-choice question; natural language processing; NLP; psychosomatic; question; response; taxonomy; EDUCATION;
D O I
10.2196/52113
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.
引用
收藏
页数:13
相关论文
共 34 条
  • [1] Automated analysis of exam questions according to bloom's taxonomy
    Omar, Nazlia
    Haris, Syahidah Sufi
    Hassan, Rosilah
    Arshad, Haslina
    Rahmat, Masura
    Zainal, Noor Faridatul Ainun
    Zulkifli, Rozli
    UNIVERSITI KEBANGSAAN MALAYSIA TEACHING AND LEARNING CONGRESS 2011, VOL I, 2012, 59 : 297 - 303
  • [2] WordNet and Cosine Similarity based Classifier of Exam Questions using Bloom's Taxonomy
    Jayakodi, K.
    Bandara, M.
    Perera, I.
    Meedeniya, D.
    INTERNATIONAL JOURNAL OF EMERGING TECHNOLOGIES IN LEARNING, 2016, 11 (04): : 142 - 149
  • [3] Assessing ChatGPT's Capability in Addressing Thyroid Cancer Patient Queries: A Comprehensive Mixed-Methods Evaluation
    Gorris, Matthew A.
    Randle, Reese W.
    Obermiller, Corey S.
    Thomas, Johnson
    Toro-Tobon, David
    Dream, Sophie Y.
    Fackelmayer, Oliver J.
    Pandian, T. K.
    Mayson, Sarah E.
    JOURNAL OF THE ENDOCRINE SOCIETY, 2025, 9 (02)
  • [4] The Analysis of Chemistry Teachers Exam Questions in Regards to the Revised Bloom's Taxonomy and Their Comparison with OSYM Questions
    Yildirim, Tamer
    PAMUKKALE UNIVERSITESI EGITIM FAKULTESI DERGISI-PAMUKKALE UNIVERSITY JOURNAL OF EDUCATION, 2020, (50): : 449 - 467
  • [5] Using Bloom’s taxonomy to evaluate the cognitive levels of Primary Leaving English Exam questions in Rwandan schools
    Muhayimana T.
    Kwizera L.
    Nyirahabimana M.R.
    Curriculum Perspectives, 2022, 42 (1) : 51 - 63
  • [6] Assessing reflective writing on a pediatric clerkship by using a modified Bloom's Taxonomy
    Plack, Margaret M.
    Driscoll, Maryanne
    Marquez, Maria
    Cuppernull, Lynn
    Maring, Joyce
    Greenberg, Larrie
    AMBULATORY PEDIATRICS, 2007, 7 (04) : 285 - 291
  • [7] Text Mining Approach Using TF-IDF and Naive Bayes for Classification of Exam Questions Based on Cognitive Level of Bloom's Taxonomy
    Aninditya, Annisa
    Hasibuan, Muhammad Azani
    Sutoyo, Edi
    2019 IEEE INTERNATIONAL CONFERENCE ON INTERNET OF THINGS AND INTELLIGENCE SYSTEM (IOTAIS), 2019, : 112 - 117
  • [8] Automatic Classification of Questions based on Bloom's Taxonomy using Artificial Neural Network
    Ifham, Mohamed
    Banujan, Kuhaneswaran
    Kumara, B. T. G. S.
    Wijeratne, P. M. A. K.
    2022 INTERNATIONAL CONFERENCE ON DECISION AID SCIENCES AND APPLICATIONS (DASA), 2022, : 311 - 315
  • [9] Evaluation of ChatGPT's Real-Life Implementation in Undergraduate Dental Education: Mixed Methods Study
    Kavadella, Argyro
    Silva, Marco Antonio Dias da
    Kaklamanos, Eleftherios G.
    Stamatopoulos, Vasileios
    Giannakopoulos, Kostis
    Kavadella, Argyro
    JMIR MEDICAL EDUCATION, 2024, 10
  • [10] USING BLOOM'S TAXONOMY TO COMPARE THE RELATIONSHIP BETWEEN EXAMINATION QUESTIONS AND LEARNING OUTCOMES
    Jones, Karl O.
    Harland, Janice
    Reid, Juliet M. V.
    Thayer, Tom
    Bartlett, Rebecca
    INTERNATIONAL JOURNAL ON INFORMATION TECHNOLOGIES AND SECURITY, 2010, 2 (01): : 15 - 24