Evaluating Language Models for Generating and Judging Programming Feedback

被引：0

作者：

Koutcheme, Charles ^{[1
]}

Dainese, Nicola ^{[1
]}

Sarsa, Sami ^{[2
]}

Hellas, Arto ^{[1
]}

Leinonen, Juho ^{[1
]}

Ashraf, Syed ^{[1
]}

Denny, Paul ^{[3
]}

机构：

[1] Aalto Univ, Espoo, Finland

[2] Univ Jyvaskyla, Jyvaskyla, Finland

[3] Univ Auckland, Auckland, New Zealand

来源：

PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 2 | 2025年

关键词：

open source; large language models; generative AI; automatic feedback; automatic evaluation; programming feedback; LLM-as-a-judge;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The emergence of large language models (LLMs) has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, LLMs have garnered significant attention, particularly in the context of learning programming. Much of the work on LLMs in CER, however, has focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments and judging the quality of programming feedback, contrasting the results with proprietary models. Our evaluations on a dataset of students' submissions to introductory Python programming exercises suggest that state-of-the-art open-source LLMs are nearly on par with proprietary models in both generating and assessing programming feedback. Additionally, we demonstrate the efficiency of smaller LLMs in these tasks and highlight the wide range of LLMs accessible, even for free, to educators and practitioners.

引用

页码：624 / 630

页数：7

共 50 条

[41] Fully Autonomous Programming with Large Language Models
Liventsev, Vadim
Grishina, Anastasiia
Harma, Aki
Moonen, Leon
PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, GECCO 2023, 2023, : 1146 - 1155
[42] C LANGUAGE AND MODELS FOR SYSTEMS PROGRAMMING.
Johnson, Stephen C.
Kernighan, Brian W.
Byte, 1983, 8 (08):
[43] A NEW METHODOLOGY FOR GENERATING TEST CASES FOR A PROGRAMMING LANGUAGE COMPILER - COMMENTS
LEE, JAN
SIGPLAN NOTICES, 1983, 18 (12): : 6 - 7
[44] Generating simulation models from natural language specifications
Cyre, WR
Armstrong, JR
Honcharik, AJ
SIMULATION, 1995, 65 (04) : 239 - 251
[45] Generating colloquial radiology reports with large language models
Tang, Cynthia Crystal
Nagesh, Supriya
Fussell, David A.
Glavis-Bloom, Justin
Mishra, Nina
Li, Charles
Cortes, Gillean
Hill, Robert
Zhao, Jasmine
Gordon, Angellica
Wright, Joshua
Troutt, Hayden
Tarrago, Rod
Chow, Daniel S.
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (11) : 2660 - 2667
[46] Large Language Models in Robot Programming Potential in the programming of industrial robots
Syniawa, Daniel
Ates, Baris
Boshoff, Marius
Kuhlenkoetter, Bernd
ATP MAGAZINE, 2024, (6-7):
[47] Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis
Dumitran, Adrian Marius
Badea, Adrian Catalin
Muscalu, Stefan-Gabriel
2024 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS, INISTA, 2024,
[48] Evaluating Computational Language Models with Scaling Properties of Natural Language
Takahashi, Shuntaro
Tanaka-Ishii, Kumiko
COMPUTATIONAL LINGUISTICS, 2019, 45 (03) : 481 - 514
[49] Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders
Koga, Shunsuke
Martin, Nicholas B.
Dickson, Dennis W.
BRAIN PATHOLOGY, 2024, 34 (03)
[50] Term feedback for information retrieval with language models
Dept. of Computer Science, University of Illinois, Urbana-Champaign
不详
Proc. Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., 2007, (263-270):

← 1 2 3 4 5 →