On the Evaluation of Neural Code Translation: Taxonomy and Benchmark

被引：2

作者：

Jiao, Mingsheng ^{[1
]}

Yu, Tingrui ^{[1
]}

Li, Xuan ^{[1
]}

Qiu, Guanjie ^{[1
]}

Gu, Xiaodong ^{[1
]}

Shen, Beijun ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Software, Shanghai, Peoples R China

来源：

2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE | 2023年

基金：

中国国家自然科学基金;

关键词：

Code Translation; Empirical Study; Benchmark; Evaluation;

D O I：

10.1109/ASE56229.2023.00114

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, neural code translation has gained increasing attention. While most of the research focuses on improving model architectures and training processes, we notice that the evaluation process and benchmark for code translation models are severely limited: they primarily treat source code as natural languages and provide a holistic accuracy score while disregarding the full spectrum of model capabilities across different translation types and complexity. In this paper, we present a comprehensive investigation of four state-of-the-art models and analyze in-depth the advantages and limitations of three existing benchmarks. Based on the empirical results, we develop a taxonomy that categorizes code translation tasks into four primary types according to their complexity and knowledge dependence: token level (type 1), syntactic level (type 2), library level (type 3), and algorithm level (type 4). We then conduct a thorough analysis of how existing approaches perform across these four categories. Our findings indicate that while state-of-the-art code translation models excel in type-1 and type-2 translations, they struggle with knowledge-dependent ones such as type-3 and type-4. Existing benchmarks are biased towards trivial translations, such as keyword mapping. To overcome these limitations, we construct G-TransEval, a new benchmark by manually curating type-3 and type-4 translation pairs and unit test cases. Results on our new benchmark suggest that G-TransEval can exhibit more comprehensive and finer-grained capability of code translation models and thus provide a more rigorous evaluation. Our studies also provide more insightful findings and suggestions for future research, such as building type-3 and type-4 training data and ensembling multiple pre-training approaches.

引用

页码：1529 / 1541

页数：13

共 35 条

[1] Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code [J].

Anh Tuan Nguyen ;

Tung Thanh Nguyen ;

Nguyen, Tien N. .

2015 30TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2015, :585-596

[2]

Austin Jacob., 2021, arXiv

[3]

Charton Francois, 2022, 10 INT C LEARN REPR

[4]

Chen M., 2021, Evaluating large language models trained on code, DOI DOI 10.48550/ARXIV.2107.03374

[5]

Choudhury M. R., 2021, P NEUR INF PROC SYST, V1

[6]

Chowdhery A, 2022, Arxiv, DOI arXiv:2204.02311

[7]

[Clark Kevin ELECTRA ELECTRA], 2020, arXiv, DOI [DOI 10.48550/arXiv.2003.10555, DOI 10.48550/ARXIV.2003.10555, 10.48550/arXiv.2003.10555]

[8]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[9]

Eghbali A, 2022, PROC IEEE ACM INT C, P341, DOI [10.1145/3510454.3528648, 10.1109/ICSE-Companion55297.2022.9793747]

[10]

Feng ZY, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P1536

← 1 2 3 4 →