Benchmarking Causal Study to Interpret Large Language Models for Source Code

被引：6

作者：

Rodriguez-Cardenas, Daniel ^{[1
]}

Palacio, David N. ^{[1
]}

Khati, Dipin ^{[1
]}

Burke, Henry ^{[1
]}

Poshyvanyk, Denys ^{[1
]}

机构：

[1] William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME | 2023年

关键词：

Software Engineering; Testbeds; Large Language Models; dl4se; Interpretability;

D O I：

10.1109/ICSME58846.2023.00040

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

One of the most common solutions adopted by software researchers to address code generation is by training Large Language Models (LLMs) on massive amounts of source code. LLMs are rooted in the concept of emergent capabilities in which machines statistically learn complex patterns from code data. Although a number of studies have shown that LLMs have been effectively evaluated on popular accuracy metrics (e.g., BLEU, CodeBleu), previous research has largely overlooked the role of Causal Inference as a fundamental component of the interpretability of LLMs' performance. Existing benchmarks and datasets are meant to highlight the difference between the expected and the generated outcome, but do not take into account confounding variables (e.g., lines of code, number of tokens, prompt size) that equally influence the accuracy metrics. The fact remains that, when dealing with generative software tasks by LLMs, no benchmark is available to tell researchers how to quantify neither the causal effect of SE-based treatments nor the correlation of confounders to the model's performance. In an effort to bring statistical rigor to the evaluation of LLMs, this paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks (i.e., code completion, code summarization, and commit generation) to help aid the interpretation of LLMs' performance. We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods. The results of the case study demonstrate the positive causal influence of prompt semantics on ChatGPT's generative performance by an average treatment effect of approximate to 3%. Moreover, it was found that confounders such as prompt size are highly correlated with accuracy metrics (approximate to 0.412). The end result of our case study is to showcase causal inference evaluations, in practice, to reduce confounding bias. By reducing the bias, we offer an interpretable solution for the accuracy metric under analysis.

引用

页码：329 / 334

页数：6

共 49 条

[1] The Adverse Effects of Code Duplication in Machine Learning Models of Code [J].

Allamams, Miltiadis .

PROCEEDINGS OF THE 2019 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE (ONWARD!' 19), 2019, :143-153

[2] Graph-based Statistical Language Model for Code [J].

Anh Tuan Nguyen ;

Nguyen, Tien N. .

2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 1, 2015, :858-868

[3]

[Anonymous], 2021, Bigquery datasets

[4]

Austin J., 2021, Program synthesis with large language models

[5]

Cassano F, 2022, Arxiv, DOI arXiv:2208.08227

[6]

Chen M., 2021, Evaluating large language models trained on code, DOI DOI 10.48550/ARXIV.2107.03374

[7]

Chen Mark., 2021, CORR, P2021, DOI [10.48550/ARXIV.2107.03374, DOI 10.48550/ARXIV.2107.03374]

[8] SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair [J].

Chen, Zimin ;

Kommrusch, Steve ;

Tufano, Michele ;

Pouchet, Louis-Noel ;

Poshyvanyk, Denys ;

Monperrus, Martin .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (09) :1943-1959

[9] An Empirical Study on the Usage of Transformer Models for Code Completion [J].

Ciniselli, Matteo ;

Cooper, Nathan ;

Pascarella, Luca ;

Mastropaolo, Antonio ;

Aghajani, Emad ;

Poshyvanyk, Denys ;

Di Penta, Massimiliano ;

Bavota, Gabriele .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (12) :4818-4837

[10] An Empirical Study on the Usage of BERT Models for Code Completion [J].

Ciniselli, Matteo ;

Cooper, Nathan ;

Pascarella, Luca ;

Poshyvanyk, Denys ;

Di Penta, Massimiliano ;

Bavota, Gabriele .

2021 IEEE/ACM 18TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2021), 2021, :108-119

← 1 2 3 4 5 →