Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

被引:0
作者
Pack A. [1 ]
Barrett A. [2 ]
Escalante J. [1 ]
机构
[1] Faculty of Education and Social Work, Brigham Young University-Hawaii, 55-220 Kulanui Street Bldg 5, Laie, 96762-1293, HI
[2] College of Education, Florida State University, Stone Building, 114 West Call Street, Tallahassee, 32306-2400, FL
来源
Computers and Education: Artificial Intelligence | 2024年 / 6卷
关键词
Artificial intelligence; Automatic essay scoring; Automatic writing evaluation; ChatGPT; Generative AI; Large language model;
D O I
10.1016/j.caeai.2024.100234
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Advancements in generative AI, such as large language models (LLMs), may serve as a potential solution to the burdensome task of essay grading often faced by language education teachers. Yet, the validity and reliability of leveraging LLMs for automatic essay scoring (AES) in language education is not well understood. To address this, we evaluated the cross-sectional and longitudinal validity and reliability of four prominent LLMs, Google's PaLM 2, Anthropic's Claude 2, and OpenAI's GPT-3.5 and GPT-4, for the AES of English language learners' writing. 119 essays taken from an English language placement test were assessed twice by each LLM, on two separate occasions, as well as by a pair of human raters. GPT-4 performed the best, demonstrating excellent intrarater reliability and good validity. All models, with the exception of GPT-3.5, improved over time in their intrarater reliability. The interrater reliability of GPT-3.5 and GPT-4, however, decreased slightly over time. These findings indicate that some models perform better than others in AES and that all models are subject to fluctuations in their performance. We discuss potential reasons for such variability, and offer suggestions for prospective avenues of research. © 2024 The Authors
引用
收藏
相关论文
共 48 条
[1]  
Attali Y., Validity and reliability of automated essay scoring, Handbook of automated essay evaluation: Current applications and new directions, pp. 181-198, (2013)
[2]  
Bahroun Z., Anane C., Ahmed V., Zacca A., Transforming education: A comprehensive review of generative artificial intelligence in educational settings through bibliometric and content analysis, Sustainability, 15, 12983, (2023)
[3]  
Baker R.S., Hawn A., Algorithmic bias in education, (2021)
[4]  
Bathaee Y., The artificial intelligence black box and the failure of intent and causation, Harvard Journal of Law and Technology, 31, 2, pp. 890-938, (2018)
[5]  
Bland J.M., Altman D.G., Measuring agreement in method comparison studies, Statistical Methods in Medical Research, 8, 2, pp. 135-160, (1999)
[6]  
Bogen M., All the ways hiring algorithms can introduce bias, Harvard Business Review, (2019)
[7]  
Bridgeman B., Human ratings and automated essay evaluation, Handbook of automated essay evaluation: Current applications and new directions, pp. 221-232, (2013)
[8]  
Bridgeman B., Trapani C., Attali Y., Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country, Applied Measurement in Education, 25, 1, pp. 27-40, (2012)
[9]  
Burstein J., Chodorow M., Automated essay scoring for nonnative English Speakers, Computer mediated language assessment and evaluation in natural language processing, pp. 68-75, (1999)
[10]  
Carlson M., Pack A., Escalante J., Utilizing OpenAI's GPT-4 for written feedback, TESOL Journal, 759, (2023)