Evaluating Language Models for Generating and Judging Programming Feedback

被引:0
|
作者
Koutcheme, Charles [1 ]
Dainese, Nicola [1 ]
Sarsa, Sami [2 ]
Hellas, Arto [1 ]
Leinonen, Juho [1 ]
Ashraf, Syed [1 ]
Denny, Paul [3 ]
机构
[1] Aalto Univ, Espoo, Finland
[2] Univ Jyvaskyla, Jyvaskyla, Finland
[3] Univ Auckland, Auckland, New Zealand
来源
PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 2 | 2025年
关键词
open source; large language models; generative AI; automatic feedback; automatic evaluation; programming feedback; LLM-as-a-judge;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The emergence of large language models (LLMs) has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, LLMs have garnered significant attention, particularly in the context of learning programming. Much of the work on LLMs in CER, however, has focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments and judging the quality of programming feedback, contrasting the results with proprietary models. Our evaluations on a dataset of students' submissions to introductory Python programming exercises suggest that state-of-the-art open-source LLMs are nearly on par with proprietary models in both generating and assessing programming feedback. Additionally, we demonstrate the efficiency of smaller LLMs in these tasks and highlight the wide range of LLMs accessible, even for free, to educators and practitioners.
引用
收藏
页码:624 / 630
页数:7
相关论文
共 50 条
  • [41] Fully Autonomous Programming with Large Language Models
    Liventsev, Vadim
    Grishina, Anastasiia
    Harma, Aki
    Moonen, Leon
    PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, GECCO 2023, 2023, : 1146 - 1155
  • [42] C LANGUAGE AND MODELS FOR SYSTEMS PROGRAMMING.
    Johnson, Stephen C.
    Kernighan, Brian W.
    Byte, 1983, 8 (08):
  • [43] A NEW METHODOLOGY FOR GENERATING TEST CASES FOR A PROGRAMMING LANGUAGE COMPILER - COMMENTS
    LEE, JAN
    SIGPLAN NOTICES, 1983, 18 (12): : 6 - 7
  • [44] Generating simulation models from natural language specifications
    Cyre, WR
    Armstrong, JR
    Honcharik, AJ
    SIMULATION, 1995, 65 (04) : 239 - 251
  • [45] Generating colloquial radiology reports with large language models
    Tang, Cynthia Crystal
    Nagesh, Supriya
    Fussell, David A.
    Glavis-Bloom, Justin
    Mishra, Nina
    Li, Charles
    Cortes, Gillean
    Hill, Robert
    Zhao, Jasmine
    Gordon, Angellica
    Wright, Joshua
    Troutt, Hayden
    Tarrago, Rod
    Chow, Daniel S.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (11) : 2660 - 2667
  • [46] Large Language Models in Robot Programming Potential in the programming of industrial robots
    Syniawa, Daniel
    Ates, Baris
    Boshoff, Marius
    Kuhlenkoetter, Bernd
    ATP MAGAZINE, 2024, (6-7):
  • [47] Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis
    Dumitran, Adrian Marius
    Badea, Adrian Catalin
    Muscalu, Stefan-Gabriel
    2024 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS, INISTA, 2024,
  • [48] Evaluating Computational Language Models with Scaling Properties of Natural Language
    Takahashi, Shuntaro
    Tanaka-Ishii, Kumiko
    COMPUTATIONAL LINGUISTICS, 2019, 45 (03) : 481 - 514
  • [49] Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders
    Koga, Shunsuke
    Martin, Nicholas B.
    Dickson, Dennis W.
    BRAIN PATHOLOGY, 2024, 34 (03)
  • [50] Term feedback for information retrieval with language models
    Dept. of Computer Science, University of Illinois, Urbana-Champaign
    不详
    Proc. Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., 2007, (263-270):