Leveraging Large Language Models for Automated Chinese Essay Scoring

被引:1
作者
Feng, Haiyue [1 ,2 ]
Du, Sixuan [2 ,3 ]
Zhu, Gaoxia [2 ]
Zou, Yan [5 ]
Poh Boon Phua [5 ]
Feng, Yuhong [1 ]
Zhong, Haoming [4 ]
Shen, Zhiqi [2 ]
Liu, Siyuan [2 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
[2] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore, Singapore
[3] Hohai Univ, Business Sch, Nanjing, Jiangsu, Peoples R China
[4] Webank, Shenzhen, Peoples R China
[5] Woodlands Secondary Sch, Singapore, Singapore
来源
ARTIFICIAL INTELLIGENCE IN EDUCATION, PT I, AIED 2024 | 2024年 / 14829卷
关键词
Automated Chinese Essay Scoring; Large Language Models; GPT;
D O I
10.1007/978-3-031-64302-6_32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automated Essay Scoring (AES) plays a crucial role in offering immediate feedback, reducing the workload of educators in grading essays, and improving students' learning experiences. With strong generalization capabilities, large language models (LLMs) offer a new perspective in AES. While previous research has primarily focused on employing deep learning architectures and models like BERT for feature extraction and scoring, the potential of LLMs in Chinese AES remains largely unexplored. In this paper, we explored the capabilities of LLMs in the realm of Chinese AES. We investigated the effectiveness of the application of well-established LLMs in Chinese AES, e.g., the GPT-series by OpenAI and Qwen-1.8B by Alibaba Cloud. We constructed a Chinese essay dataset with carefully developed rubrics, based on which we acquired grades from human raters. Then we fed in prompts to LLMs, specifically GPT-4, fine-tuned GPT-3.5 and Qwen to get grades, where different strategies were adopted for prompt generations and model fine-tuning. The comparisons between the grades provided by LLMs and human raters suggest that the strategies to generate prompts have a remarkable impact on the grade agreement between LLMs and human raters. When model fine-tuning was adopted, the consistency between LLMs' scores and human scores was further improved. Comparative experimental results demonstrate that fine-tuned GPT-3.5 and Qwen outperform BERT in QWK score. These results highlight the substantial potential of LLMs in Chinese AES and pave the way for further research in the integration of LLMs within Chinese AES, employing varied strategies for prompt generation and model fine-tuning.
引用
收藏
页码:454 / 467
页数:14
相关论文
共 27 条
[1]  
Abraham Bejoy, 2019, Informatics in Medicine Unlocked, V17, P150, DOI 10.1016/j.imu.2019.100256
[2]  
Bai J.Y.H., 2022, Tenth Pan-Commonwealth Forum on Open Learning, DOI [10.56059/pcf10.8339, DOI 10.56059/PCF10.8339]
[3]  
Bai JZ, 2023, Arxiv, DOI [arXiv:2309.16609, DOI 10.48550/ARXIV.2309.16609]
[4]  
Chen BH, 2024, Arxiv, DOI [arXiv:2310.14735, DOI 10.48550/ARXIV.2310.14735, 10.48550/ARXIV.2310.14735]
[5]   A Ranked-based Learning Approach To Automated Essay Scoring [J].
Chen, Hongbo ;
He, Ben ;
Luo, Tiejian ;
Li, Baobin .
SECOND INTERNATIONAL CONFERENCE ON CLOUD AND GREEN COMPUTING / SECOND INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING AND ITS APPLICATIONS (CGC/SCA 2012), 2012, :448-455
[6]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[7]  
Gong JF, 2021, ACL-IJCNLP 2021: THE JOINT CONFERENCE OF THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE SYSTEM DEMONSTRATIONS, P240
[8]   Understanding Lexical Features for Chinese Essay Grading [J].
Guan, Yifei ;
Xie, Yi ;
Liu, Xiaoyue ;
Sun, Yuqing ;
Gong, Bin .
COMPUTER SUPPORTED COOPERATIVE WORK AND SOCIAL COMPUTING, CHINESECSCW 2019, 2019, 1042 :645-657
[9]  
He Yaqiong, 2022, P 29 INT C COMPUTATI, P3007
[10]  
Hu EJ, 2021, Arxiv, DOI [arXiv:2106.09685, 10.48550/arXiv.2106.09685, DOI 10.48550/ARXIV.2106.09685]