CodeScore: Evaluating Code Generation by Learning Code Execution

被引：1

作者：

Dong, Yihong ^{[1
,2
]}

Ding, Jiazheng ^{[1
,2
]}

Jiang, Xue ^{[1
,2
]}

Li, Ge ^{[1
,2
]}

Li, Zhuo ^{[1
,2
]}

Jin, Zhi ^{[1
,2
]}

机构：

[1] Peking Univ, Key Lab High Confidence Software Technol, Minist Educ, Beijing, Peoples R China

[2] Peking Univ, Sch Comp Sci, Beijing, Peoples R China

来源：

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY | 2025年 / 34卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Code Evaluation; Code Pre-trained Language Model; Code Generation;

D O I：

10.1145/3695991

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from two significant drawbacks. 1. They primarily measure the surface differences between codes without considering their functional equivalence. However, functional equivalence is pivotal in evaluating the effectiveness of code generation, as different codes can perform identical operations. 2. They are predominantly designed for the Ref-only input format. However, code evaluation necessitates versatility in input formats. Aside from Ref-only, there are NL-only and Ref and NL formats, which existing match-based CEMs cannot effectively accommodate. In this article, we propose CodeScore, a large language model (LLM)based CEM, which estimates the functional correctness of generated code on three input types. To acquire CodeScore, we present UniCE, a unified code generation learning framework, for LLMs to learn code execution (i.e., learning PassRatio and Executability of generated code) with unified input. Extensive experimental results on multiple code evaluation datasets demonstrate that CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.

引用

页数：22

共 68 条

[51]

Sellam T, 2020, P 58 ANN M ASS COMP, P7881, DOI [DOI 10.18653/V1/2020.ACL-MAIN.704.URL, 10.18653/V1/2020.ACL-MAIN.704]

[52] Incorporating Domain Knowledge through Task Augmentation for Front-End Java']JavaScript Code Generation [J].

Shen, Sijie ;

Zhu, Xiang ;

Dong, Yihong ;

Guo, Qizhi ;

Zhen, Yankun ;

Li, Ge .

PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, :1533-1543

[53] Code Search based on Context-aware Code Translation [J].

Sun, Weisong ;

Fang, Chunrong ;

Chen, Yuchen ;

Tao, Guanhong ;

Han, Tingxu ;

Zhang, Quanjun .

2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, :388-400

[54]

Sun ZY, 2020, AAAI CONF ARTIF INTE, V34, P8984

[55]

Tenney Ian, 2019, ACL, V1, P4593

[56]

Wan Y, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P8117

[57]

Wang Y, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P8696

[58]

Wei BL, 2019, ADV NEUR IN, V32

[59] Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study [J].

Wei, Xiaokai ;

Gonugondla, Sujan Kumar ;

Wang, Shiqi ;

Ahmad, Wasi ;

Ray, Baishakhi ;

Qian, Haifeng ;

Li, Xiaopeng ;

Kumar, Varun ;

Wang, Zijian ;

Tian, Yuchen ;

Sun, Qing ;

Athiwaratkun, Ben ;

Shang, Mingyue ;

Ramanathan, Murali Krishna ;

Bhatia, Parminder ;

Xiang, Bing .

PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023, 2023, :224-236

[60]

Yin PC, 2018, CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P7

← 1 2 3 4 5 6 7 →