The Impact of Data Merging on the Interpretation of Cross-Project Just-In-Time Defect Models

被引：8

作者：

Lin, Dayi ^{[1
]}

Tantithamthavorn, Chakkrit ^{[2
]}

Hassan, Ahmed E. ^{[3
]}

机构：

[1] Ctr Software Excellence, Huawei, ON L3R 5A4, Canada

[2] Monash Univ, Fac Informat Technol, Clayton, Vic 3800, Australia

[3] Queens Univ, Sch Comp, Kingston, ON K7L 3N6, Canada

来源：

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING | 2022年 / 48卷 / 08期

基金：

澳大利亚研究理事会;

关键词：

Context modeling; Data models; Predictive models; Measurement; Training; Merging; Planning; Just-in-time defect prediction; data merging; mixed-effect model; cross-project defect prediction; PREDICTION;

D O I：

10.1109/TSE.2021.3073920

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Just-In-Time (JIT) defect models are classification models that identify the code commits that are likely to introduce defects. Cross-project JIT models have been introduced to address the suboptimal performance of JIT models when historical data is limited. However, many studies built cross-project JIT models using a pool of mixed data from multiple projects (i.e., data merging)-assuming that the properties of defect-introducing commits of a project are similar to that of the other projects, which is likely not true. In this paper, we set out to investigate the interpretation of JIT defect models that are built from individual project data and a pool of mixed project data with and without consideration of project-level variances. Through a case study of 20 datasets of open source projects, we found that (1) the interpretation of JIT models that are built from individual projects varies among projects; and (2) the project-level variances cannot be captured by a JIT model that is trained from a pool of mixed data from multiple projects without considering project-level variances (i.e., a global JIT model). On the other hand, a mixed-effect JIT model that considers project-level variances represents the different interpretations better, without sacrificing performance, especially when the contexts of projects are considered. The results hold for different mixed-effect learning algorithms. When the goal is to derive sound interpretation of cross-project JIT models, we suggest that practitioners and researchers should opt to use a mixed-effect modelling approach that considers individual projects and contexts.

引用

页码：2969 / 2986

页数：18

共 61 条

[1]

Agresti A., 2013, Categorical Data Analysis, V3rd ed

[2] Towards improving statistical modeling of software engineering data: think locally, act globally! [J].