AI in Education: An Analysis of Large Language Models for Twi Automatic Short Answer Grading

被引:0
作者
Agyemang, Alex [1 ]
Schlippe, Tim [1 ]
机构
[1] IU Int Univ Appl Sci, Bad Honnef, Germany
来源
ARTIFICIAL INTELLIGENCE RESEARCH, SACAIR 2024 | 2025年 / 2326卷
关键词
AI in Education; Automatic Short Answer Grading; Natural Language Processing; Africa; Large-Language Models; LLMs;
D O I
10.1007/978-3-031-78255-8_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic short answer grading can significantly enhance the speed and fairness of grading, making it particularly valuable in areas with a shortage of teachers, such as Africa [1]. However, for most African languages it is very challenging to build automatic short answer grading systems due to the limited availability of natural language processing corpora. Furthermore, only experts can deal with the complex algorithms, required for training and fine-tuning traditional automatic short answer grading systems. Given that state-of-the-art large language models have the potential to address these problems through their growing capabilities and ease of use through prompting, particularly in zero-shot and few-shot learning, we investigated their performance for grading student answers in the African language Twi. To address the absence of a Twi corpus, we translated and validated the University of North Texas benchmark corpus [2], creating the first Twi automatic short answer grading corpus. On this corpus, we evaluated the performances of the large language models GPT-4o [3], Claude 3 Sonnet [4], and LLaMA 3 [5] as well as for comparison two more traditional approaches: a fine-tuned AfroLM and a cross-lingual M-BERT approach. Among individual models, the cross-lingual M-BERT had the best performance with a mean absolute error of 0.79 points out of 5 points, followed by fine-tuned AfroLM at 0.73 points and Claude 3 Sonnet at 1.00 points. However, combining AfroLM and M-BERT outputs achieved the lowest mean absolute error of 0.64 points, which is less than the human grader variance of 0.75 points in the original corpus [6]. Combining the outputs of the large language models GPT-4o, Claude 3 Sonnet, and LLaMA 3, obtained through few-shot learning, yielded a mean absolute error of 1.10 points.
引用
收藏
页码:107 / 123
页数:17
相关论文
共 58 条
[1]  
Aboagye F., 2021, Text-to-Speech for Ghanaian Language (Akuapem Twi) on an Embedded System, capstone Project
[2]  
Adjeisah M., 2020, Acad. J. Sci. Res., V8, P371
[3]   Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG) [J].
Agyei, Emmanuel ;
Zhang, Xiaoling ;
Bannerman, Stephen ;
Quaye, Ama Bonuah ;
Yussi, Sophyani Banaamwini ;
Agbesi, Victor Kwaku .
DISCOVER COMPUTING, 2024, 27 (01)
[4]  
Alabi J.O., 2020, 12 C LANG RES EV LRE
[5]  
Alabi J.O., 2019, INT C LANG RES EV
[6]  
alnresources. wordpress, The African Linguists Network Blog: Language Guide
[7]  
amesall.rutgers, 2022, Akan (Twi) at Rutgers
[8]  
[Anonymous], 2016, Sustainable Development Goals Knowledge Platform
[9]  
[Anonymous], 2024, The Claude 3 Model Family: Opus, Sonnet, Haiku
[10]  
Azunre P., 2021, 2 AFRICANLP WORKSH P