Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

被引:0
作者
Pack A. [1 ]
Barrett A. [2 ]
Escalante J. [1 ]
机构
[1] Faculty of Education and Social Work, Brigham Young University-Hawaii, 55-220 Kulanui Street Bldg 5, Laie, 96762-1293, HI
[2] College of Education, Florida State University, Stone Building, 114 West Call Street, Tallahassee, 32306-2400, FL
来源
Computers and Education: Artificial Intelligence | 2024年 / 6卷
关键词
Artificial intelligence; Automatic essay scoring; Automatic writing evaluation; ChatGPT; Generative AI; Large language model;
D O I
10.1016/j.caeai.2024.100234
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Advancements in generative AI, such as large language models (LLMs), may serve as a potential solution to the burdensome task of essay grading often faced by language education teachers. Yet, the validity and reliability of leveraging LLMs for automatic essay scoring (AES) in language education is not well understood. To address this, we evaluated the cross-sectional and longitudinal validity and reliability of four prominent LLMs, Google's PaLM 2, Anthropic's Claude 2, and OpenAI's GPT-3.5 and GPT-4, for the AES of English language learners' writing. 119 essays taken from an English language placement test were assessed twice by each LLM, on two separate occasions, as well as by a pair of human raters. GPT-4 performed the best, demonstrating excellent intrarater reliability and good validity. All models, with the exception of GPT-3.5, improved over time in their intrarater reliability. The interrater reliability of GPT-3.5 and GPT-4, however, decreased slightly over time. These findings indicate that some models perform better than others in AES and that all models are subject to fluctuations in their performance. We discuss potential reasons for such variability, and offer suggestions for prospective avenues of research. © 2024 The Authors
引用
收藏
相关论文
共 48 条
[11]  
Carter M.J., Harper H., Student writing: Strategies to reverse ongoing decline, Academic Questions, 26, pp. 285-295, (2013)
[12]  
Chan K.Y.C., A comprehensive AI policy education framework for university teaching and learning, International Journal on Educational Technology in Higher Education, 20, 38, (2023)
[13]  
Chen L., Zaharia M., Zou J., How is ChatGPT's behavior changing over time?, (2023)
[14]  
Dai W., Lin J., Jin F., Li T., Tsai Y.-S., Gasevic D., Chen G., Can large language models provide feedback to student?, A case study on ChatGPT, (2023)
[15]  
de Raadt A., Warrens M.J., Bosker R.J., Kiers H.A., A comparison of reliability coefficients for ordinal rating scales, Journal of Classification, 38, pp. 519-543, (2021)
[16]  
Eckes T., Rater types in writing performance assessments: A classification approach to rater variability, Language Testing, 25, 2, pp. 155-185, (2008)
[17]  
Ericsson P.F., The meaning of meaning: Is a paragraph more than an equation?, Machine scoring of student essays: Truth and consequences, pp. 28-37, (2006)
[18]  
Escalante J., Pack A., Barrett A., AI-generated feedback on writing: Insights into efficacy and ENL student preference, International Journal of Educational Technology in Higher Education, 20, (2023)
[19]  
Gardner J., O'Leary M., Yuan L., Artificial intelligence in educational assessment: ‘Breakthrough? Or buncombe and ballyhoo?”, Journal of Computer Assisted Learning, 37, pp. 1207-1216, (2021)
[20]  
Godwin-Jones R., Partnering with AI: Intelligent writing assistance and instructed language learning, Language, Learning and Technology, 26, 2, pp. 5-54, (2022)