Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

被引：0

作者：

Pack A. ^{[1
]}

Barrett A. ^{[2
]}

Escalante J. ^{[1
]}

机构：

[1] Faculty of Education and Social Work, Brigham Young University-Hawaii, 55-220 Kulanui Street Bldg 5, Laie, 96762-1293, HI

[2] College of Education, Florida State University, Stone Building, 114 West Call Street, Tallahassee, 32306-2400, FL

来源：

Computers and Education: Artificial Intelligence | 2024年 / 6卷

关键词：

Artificial intelligence; Automatic essay scoring; Automatic writing evaluation; ChatGPT; Generative AI; Large language model;

D O I：

10.1016/j.caeai.2024.100234

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Advancements in generative AI, such as large language models (LLMs), may serve as a potential solution to the burdensome task of essay grading often faced by language education teachers. Yet, the validity and reliability of leveraging LLMs for automatic essay scoring (AES) in language education is not well understood. To address this, we evaluated the cross-sectional and longitudinal validity and reliability of four prominent LLMs, Google's PaLM 2, Anthropic's Claude 2, and OpenAI's GPT-3.5 and GPT-4, for the AES of English language learners' writing. 119 essays taken from an English language placement test were assessed twice by each LLM, on two separate occasions, as well as by a pair of human raters. GPT-4 performed the best, demonstrating excellent intrarater reliability and good validity. All models, with the exception of GPT-3.5, improved over time in their intrarater reliability. The interrater reliability of GPT-3.5 and GPT-4, however, decreased slightly over time. These findings indicate that some models perform better than others in AES and that all models are subject to fluctuations in their performance. We discuss potential reasons for such variability, and offer suggestions for prospective avenues of research. © 2024 The Authors

引用

共 48 条

[41]

Ramineni C., Williamson D., Understanding mean score differences between the e-rater® automated scoring engine and humans for demographically based groups in the GRE® general test, ETS Research Report Series, 2018, pp. 1-31, (2018)

[42]

Shermis M.D., Burstein J., Bursky S.A., Introduction to automated essay evaluation, Handbook of automated essay evaluation: Current applications and new directions, pp. 1-15, (2013)

[43]

Shrout P.E., Fleiss J.L., Intraclass correlations: Uses in assessing rater reliability, Psychological Bulletin, 86, 2, pp. 420-428, (1979)

[44]

Singleton-Jackson J.A., Lumsden D.B., Newsom R., Johnny still can't write, even if he goes to college: A study of writing proficiency in higher education graduate students, Current Issues in Education, 12, 10, (2009)

[45]

Warschauer M., Ware P., Automated writing evaluation: Defining the classroom research agenda, Language Teaching Research, 10, 2, pp. 157-180, (2006)

[46]

Weigle S.C., Using FACETS to model rater training effects, Language Testing, 15, 2, pp. 263-287, (1998)

[47]

Weigle S.C., English as a second language writing and automated essay evaluation, Handbook of automated essay evaluation: Current applications and new directions, pp. 36-54, (2013)

[48]

Zhou Y., Muresanu A.I., Han Z., Paster K., Pitis S., Chan H., Ba J., Large language models are human-level prompt engineers, International conference on learning representations 2023, (2023)

← 1 2 3 4 5 →