Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair

被引：228

作者：

Smith, Edward K. ^{[1
]}

Barr, Earl T. ^{[2
]}

Le Goues, Claire ^{[3
]}

Brun, Yuriy ^{[1
]}

机构：

[1] Univ Massachusetts, Amherst, MA 01003 USA

[2] UCL, London, England

[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

2015 10TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE 2015) PROCEEDINGS | 2015年

基金：

美国国家科学基金会;

关键词：

automated program repair; empirical evaluation; independent evaluation; GenProg; TrpAutoRepair; INTROCLASS;

D O I：

10.1145/2786805.2786825

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Automated program repair has shown promise for reducing the significant manual effort debugging requires. This paper addresses a deficit of earlier evaluations of automated repair techniques caused by repairing programs and evaluating generated patches' correctness using the same set of tests. Since tests are an imperfect metric of program correctness, evaluations of this type do not discriminate between correct patches and patches that overfit the available tests and break untested but desired functionality. This paper evaluates two well-studied repair tools, GenProg and TrpAutoRepair, on a publicly available benchmark of 998 bugs, each with a human-written patch. By evaluating patches using tests independent from those used during repair, we find that the tools are unlikely to improve the proportion of independent tests passed, and that the quality of the patches is proportional to the coverage of the test suite used during repair. For programs that pass most tests, the tools are as likely to break tests as to fix them. However, novice developers also overfit, and automated repair performs no worse than these developers. In addition to overfitting, we measure the effects of test suite coverage, test suite provenance, and starting program quality, as well as the difference in quality between novice-developer-written and tool-generated patches when quality is assessed with a test suite independent from the one used for patch generation.

引用

页码：532 / 543

页数：12

共 63 条

[1]

Abd-El-Malek Michael, 2005, Operating Systems Review (OSR), V39, P59, DOI [10.1145/1095810.1095817, DOI 10.1145/1095810.1095817]

[2] FUNCTIONAL FIXEDNESS AS RELATED TO PROBLEM SOLVING - A REPETITION OF 3 EXPERIMENTS [J].

ADAMSON, RE .

JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 1952, 44 (04) :288-291

[3]

Alba E, 2007, GECCO 2007: GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, VOL 1 AND 2, P1066

[4]

Alkhalaf M., 2014, P 2014 INT S SOFTWAR, P225

[5]

[Anonymous], ACM SIGSOFT INT S FD

[6]

[Anonymous], P 30 IEEE ACM INT C

[7]

[Anonymous], NETW DISTR SYST SEC

[8]

[Anonymous], 2015, CORR

[9]

[Anonymous], ACM IEEE INT C SOFTW

[10]

[Anonymous], IEEE ACM INT C AUT S

← 1 2 3 4 5 6 7 →