Automatic item generation in various STEM subjects using large language model prompting

被引：0

作者：

Chan, Kuang Wen ^{[1
]}

Ali, Farhan ^{[2
]}

Park, Joonhyeong ^{[2
]}

Sham, Kah Shen Brandon ^{[1
]}

Tan, Erdalyn Yeh Thong ^{[1
]}

Chong, Francis Woon Chien ^{[1
]}

Qian, Kun ^{[1
]}

Sze, Guan Kheng ^{[1
]}

机构：

[1] National Institute of Education, Nanyang Technological University

来源：

Computers and Education: Artificial Intelligence | 2025年 / 8卷

关键词：

Assessment; Automatic item generation; Generative artificial intelligence; Large language model; STEM education;

D O I：

10.1016/j.caeai.2024.100344

中图分类号：

学科分类号：

摘要：

Large language models (LLMs) that power chatbots such as ChatGPT have capabilities across numerous domains. Teachers and students have been increasingly using chatbots in science, technology, engineering, and mathematics (STEM) subjects in various ways, including for assessment purposes. However, there has been a lack of systematic investigation into LLMs’ capabilities and limitations in automatically generating items for STEM subject assessments, especially given that LLMs can hallucinate and may risk promoting misconceptions and hindering conceptual understanding. To address this, we systematically investigated LLMs' conceptual understanding and quality of working in generating question and answer pairs across various STEM subjects. We used prompt engineering on GPT-3.5 and GPT-4 with three different approaches: standard prompting, standard prompting with added chain-of-thought prompting using worked examples with steps, and the chain-of-thought prompting with coding language. The questions and answer pairs were generated at the pre-university level in the three STEM subjects of chemistry, physics, and mathematics and evaluated by subject-matter experts. We found that LLMs generated quality questions when using the chain-of-thought prompting for both GPT-3.5 and GPT-4 and when using the chain-of-thought prompting with coding language for GPT-4 overall. However, there were varying patterns in generating multistep answers, with differences in final answer and intermediate step accuracy. An interesting finding was that the chain-of-thought prompting with coding language for GPT-4 significantly outperformed the other approaches in generating correct final answers while demonstrating moderate accuracy in generating multistep answers correctly. In addition, through qualitative analysis, we identified domain-specific prompting patterns across the three STEM subjects. We then discussed how our findings aligned with, contradicted, and contributed to the current body of knowledge on automatic item generation research using LLMs, and the implications for teachers using LLMs to generate STEM assessment items. © 2024 The Authors

引用

共 64 条

[1]

Ahmad N., Murugesan S., Kshetri N., Generative artificial intelligence and the education sector, Computer, 56, 6, pp. 72-76, (2023)

[2]

Ali F., Choy D., Divaharan S., Tay H.Y., Chen W., Supporting self-directed learning and self-assessment using TeacherGAIA, a generative AI chatbot application: Learning approaches and prompt engineering, Learning: Research and Practice, 9, 2, pp. 135-147, (2023)

[3]

Artsi Y., Sorin V., Konen E., Glicksberg B.S., Nadkarni G., Klang E., Large language models for generating medical examinations: Systematic review, BMC Medical Education, 24, 1, (2024)

[4]

Attali Y., Runge A., LaFlair G.T., Yancey K., Goodwin S., Park Y., Von Davier A.A., The interactive reading task: Transformer-based automatic item generation, Frontiers in Artificial Intelligence, 5, (2022)

[5]

Balhorn L.S., Weber J.M., Buijsman S., Hildebrandt J.R., Ziefle M., Schweidtmann A.M., Empirical assessment of ChatGPT's answering capabilities in natural science and engineering, Scientific Reports, 14, 1, (2024)

[6]

Brassil C.E., Couch B.A., Multiple-true-false questions reveal more thoroughly the complexity of student thinking than multiple-choice questions: A bayesian item response model comparison, International Journal of STEM Education, 6, 1, pp. 1-17, (2019)

[7]

Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Amodei D., Language models are few-shot learners, Advances in Neural Information Processing Systems, 33, pp. 1877-1901, (2020)

[8]

Bsharat S.M., Myrzakhan A., Shen Z., Principled instructions are all you need for questioning llama-1/2, gpt-3.5/4, (2023)

[9]

Chang Y., Wang X., Wang J., Wu Y., Yang L., Zhu K., Xie X., A survey on evaluation of large language models, ACM Transactions on Intelligent Systems and Technology, 15, 3, pp. 1-45, (2024)

[10]

Cheung B.H.H., Lau G.K.K., Wong G.T.C., Lee E.Y.P., Kulkarni D., Seow C.S., Wong R., Co M.T.-H., ChatGPT versus human in generating medical graduate exam multiple choice questions—a multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom), PLoS One, 18, 8, (2023)

← 1 2 3 4 5 6 7 →