Beyond Accuracy and Robustness Metrics for Large Language Models for Code

被引：2

作者：

Rodriguez-Cardenas, Daniel ^{[1
]}

机构：

[1] William & Mary, Williamsburg, VA 23185 USA

来源：

2024 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS, ICSE-COMPANION 2024 | 2024年

关键词：

deep learning; code generation; interpretability; transformers;

D O I：

10.1145/3639478.3639792

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, Large Language Models for code (LLMc) have transformed the landscape of software engineering (SE), demonstrating significant efficacy in tasks such as code completion, summarization, review, tracing, translation, test case generation, clone detection, and bug fixing. Notably, GitHub Copilot [31] and Google's CodeBot [21] exemplify how LLMc contributes to substantial time and effort savings in software development. However, despite their widespread use, there is a growing need to thoroughly assess LLMc, as current evaluation processes heavily rely on accuracy and robustness metrics, lacking consensus on additional influential factors in code generation. This gap hinders a holistic understanding of LLMc performance, impacting interpretability, efficiency, bias, fairness, and robustness. The challenges in benchmarking and data maintenance compound this issue, underscoring the necessity for a comprehensive evaluation approach. To address these issues, this dissertation proposes the development of a benchmarking infrastructure, named HolBench, aimed at overcoming gaps in evaluating LLMc quality. The goal is to standardize testing scenarios, facilitate meaningful comparisons across LLMc, and provide multi-metric measurements beyond a sole focus on accuracy. This approach aims to decrease the costs associated with advancing LLMc research, enhancing their reliability for adoption in academia and industry.

引用

页码：159 / 161

页数：3

共 37 条

[1]

2023, Arxiv, DOI [arXiv:2303.08774, DOI 10.48550/ARXIV.2303.08774, 10.48550/arXiv.2303.08774]

[2] Graph-based Statistical Language Model for Code [J].

Anh Tuan Nguyen ;

Nguyen, Tien N. .

2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 1, 2015, :858-868

[3]

Austin J., 2021, arXiv, DOI DOI 10.48550/ARXIV.2108.07732

[4]

Cassano F, 2022, Arxiv, DOI arXiv:2208.08227

[5]

Chen M., 2021, arXiv

[6] SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair [J].

Chen, Zimin ;

Kommrusch, Steve ;

Tufano, Michele ;

Pouchet, Louis-Noel ;

Poshyvanyk, Denys ;

Monperrus, Martin .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (09) :1943-1959

[7] An Empirical Study on the Usage of Transformer Models for Code Completion [J].

Ciniselli, Matteo ;

Cooper, Nathan ;

Pascarella, Luca ;

Mastropaolo, Antonio ;

Aghajani, Emad ;

Poshyvanyk, Denys ;

Di Penta, Massimiliano ;

Bavota, Gabriele .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (12) :4818-4837

[8] Can We Automatically Fix Bugs by Learning Edit Operations? [J].

Connor, Aidan ;

Harris, Aaron ;

Cooper, Nathan ;

Poshyvanyk, Denys .

2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, :782-792

[9]

Hendrycks D, 2021, Arxiv, DOI arXiv:2105.09938

[10]

Hou XY, 2024, Arxiv, DOI [arXiv:2308.10620, DOI 10.48550/ARXIV.2308.10620]

← 1 2 3 4 →