CodeScore: Evaluating Code Generation by Learning Code Execution

被引：0

作者：

Dong, Yihong ^{[1
,2
]}

Ding, Jiazheng ^{[1
,2
]}

Jiang, Xue ^{[1
,2
]}

Li, Ge ^{[1
,2
]}

Li, Zhuo ^{[1
,2
]}

Jin, Zhi ^{[1
,2
]}

机构：

[1] Peking Univ, Key Lab High Confidence Software Technol, Minist Educ, Beijing, Peoples R China

[2] Peking Univ, Sch Comp Sci, Beijing, Peoples R China

来源：

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY | 2025年 / 34卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Code Evaluation; Code Pre-trained Language Model; Code Generation;

D O I：

10.1145/3695991

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from two significant drawbacks. 1. They primarily measure the surface differences between codes without considering their functional equivalence. However, functional equivalence is pivotal in evaluating the effectiveness of code generation, as different codes can perform identical operations. 2. They are predominantly designed for the Ref-only input format. However, code evaluation necessitates versatility in input formats. Aside from Ref-only, there are NL-only and Ref and NL formats, which existing match-based CEMs cannot effectively accommodate. In this article, we propose CodeScore, a large language model (LLM)based CEM, which estimates the functional correctness of generated code on three input types. To acquire CodeScore, we present UniCE, a unified code generation learning framework, for LLMs to learn code execution (i.e., learning PassRatio and Executability of generated code) with unified input. Extensive experimental results on multiple code evaluation datasets demonstrate that CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.

引用

页数：22

共 68 条

[1] Arakelyan S, 2022, Arxiv, DOI arXiv:2205.10674
[2] Austin J., 2021, arXiv
[3] Banerjee Satanjeev, 2005, ACL WORKSHOPS, P65
[4] BolinWei Ge Li, 2019, ADV NEURAL INFORM PR, P6559
[5] Bravais Auguste., 1844, Analyse mathematique sur les probabilites des erreurs de situation d'un point
[6] Chen Mark, 2021, PREPRINT
[7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8] Self-Collaboration Code Generation via ChatGPT
Dong, Yihong
Jiang, Xue
Jin, Zhi
Li, Ge
[J]. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2024, 33 (07)
[9] Dong YH, 2024, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, P7304
[10] Dong YH, 2024, Arxiv, DOI [arXiv:2402.15938, DOI 10.48550/ARXIV:2402.15938]

← 1 2 3 4 5 6 7 →