The impact of tangled code changes on defect prediction models

被引:55
作者
Herzig, Kim [1 ]
Just, Sascha [2 ]
Zeller, Andreas [3 ]
机构
[1] Microsoft Res, Empir Software Engn Grp, 21 Stn Rd, Cambridge CB1 2DZ, England
[2] Univ Saarland, Software Engn Chair, Campus E1-1, D-66123 Saarbrucken, Germany
[3] Univ Saarland, Software Engn, Campus E1-1, D-66123 Saarbrucken, Germany
关键词
Defect prediction; Untangling; Data noise; SOFTWARE CHANGES;
D O I
10.1007/s10664-015-9376-6
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
When interacting with source control management system, developers often commit unrelated or loosely related code changes in a single transaction. When analyzing version histories, such tangled changes will make all changes to all modules appear related, possibly compromising the resulting analyses through noise and bias. In an investigation of five open-source Java projects, we found between 7 % and 20 % of all bug fixes to consist of multiple tangled changes. Using a multi-predictor approach to untangle changes, we show that on average at least 16.6 % of all source files are incorrectly associated with bug reports. These incorrect bug file associations seem to not significantly impact models classifying source files to have at least one bug or no bugs. But our experiments show that untangling tangled code changes can result in more accurate regression bug prediction models when compared to models trained and tested on tangled bug datasets-in our experiments, the statistically significant accuracy improvements lies between 5 % and 200 %. We recommend better change organization to limit the impact of tangled changes.
引用
收藏
页码:303 / 336
页数:34
相关论文
共 44 条
  • [1] Measuring the Progress of Projects Using the Time Dependence of Code Changes
    Alam, Omar
    Adams, Bram
    Hassan, Ahmed E.
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, CONFERENCE PROCEEDINGS, 2009, : 329 - 338
  • [2] [Anonymous], 2010, R LANG ENV STAT COMP
  • [3] [Anonymous], P 5 INT C PRED MOD S
  • [4] [Anonymous], 2010, P FAST SOFTW ENCR WO
  • [5] [Anonymous], AUTOMAT SOFTW ENG
  • [6] [Anonymous], P 1995 ACM IEEE C SU
  • [7] Anvik J., 2006, P 28 INT C SOFTW ENG, P361, DOI DOI 10.1145/1134285.1134336
  • [8] Bhattacharya P, 2011, 2011 33RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), P1122, DOI 10.1145/1985793.1986012
  • [9] Bird Christian, 2009, 2009 20th International Symposium on Software Reliability Engineering (ISSRE 2009), P109, DOI 10.1109/ISSRE.2009.17
  • [10] Fair and Balanced? Bias in Bug-Fix Datasets
    Bird, Christian
    Bachmann, Adrian
    Aune, Eirik
    Duffy, John
    Bernstein, Abraham
    Filkov, Vladimir
    Devanbu, Premkumar
    [J]. 7TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2009, : 121 - 130