The Impact of Data Merging on the Interpretation of Cross-Project Just-In-Time Defect Models

被引:8
作者
Lin, Dayi [1 ]
Tantithamthavorn, Chakkrit [2 ]
Hassan, Ahmed E. [3 ]
机构
[1] Ctr Software Excellence, Huawei, ON L3R 5A4, Canada
[2] Monash Univ, Fac Informat Technol, Clayton, Vic 3800, Australia
[3] Queens Univ, Sch Comp, Kingston, ON K7L 3N6, Canada
基金
澳大利亚研究理事会;
关键词
Context modeling; Data models; Predictive models; Measurement; Training; Merging; Planning; Just-in-time defect prediction; data merging; mixed-effect model; cross-project defect prediction; PREDICTION;
D O I
10.1109/TSE.2021.3073920
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Just-In-Time (JIT) defect models are classification models that identify the code commits that are likely to introduce defects. Cross-project JIT models have been introduced to address the suboptimal performance of JIT models when historical data is limited. However, many studies built cross-project JIT models using a pool of mixed data from multiple projects (i.e., data merging)-assuming that the properties of defect-introducing commits of a project are similar to that of the other projects, which is likely not true. In this paper, we set out to investigate the interpretation of JIT defect models that are built from individual project data and a pool of mixed project data with and without consideration of project-level variances. Through a case study of 20 datasets of open source projects, we found that (1) the interpretation of JIT models that are built from individual projects varies among projects; and (2) the project-level variances cannot be captured by a JIT model that is trained from a pool of mixed data from multiple projects without considering project-level variances (i.e., a global JIT model). On the other hand, a mixed-effect JIT model that considers project-level variances represents the different interpretations better, without sacrificing performance, especially when the contexts of projects are considered. The results hold for different mixed-effect learning algorithms. When the goal is to derive sound interpretation of cross-project JIT models, we suggest that practitioners and researchers should opt to use a mixed-effect modelling approach that considers individual projects and contexts.
引用
收藏
页码:2969 / 2986
页数:18
相关论文
共 61 条
[1]  
Agresti A., 2013, Categorical Data Analysis, V3rd ed
[2]   Towards improving statistical modeling of software engineering data: think locally, act globally! [J].
Bettenburg, Nicolas ;
Nagappan, Meiyappan ;
Hassan, Ahmed E. .
EMPIRICAL SOFTWARE ENGINEERING, 2015, 20 (02) :294-335
[3]   Generalized linear mixed models: a practical guide for ecology and evolution [J].
Bolker, Benjamin M. ;
Brooks, Mollie E. ;
Clark, Connie J. ;
Geange, Shane W. ;
Poulsen, John R. ;
Stevens, M. Henry H. ;
White, Jada-Simone S. .
TRENDS IN ECOLOGY & EVOLUTION, 2009, 24 (03) :127-135
[4]   Belief & Evidence in Empirical Software Engineering [J].
Devanbu, Prem ;
Zimmermann, Thomas ;
Bird, Christian .
2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2016, :108-119
[5]  
Fukushima T., 2014, P 11 WORK C MIN SOFT, P172, DOI DOI 10.1145/2597073.2597075
[6]  
Guo Philip J, 2010, P 32 ACMIEEE INT C S, V1, P495, DOI 10.1145/1806799.1806871
[7]   Generalized mixed effects regression trees [J].
Hajjem, Ahlem ;
Larocque, Denis ;
Bellavance, Francois .
STATISTICS & PROBABILITY LETTERS, 2017, 126 :114-118
[8]   Mixed-effects random forest for clustered data [J].
Hajjem, Ahlem ;
Bellavance, Francois ;
Larocque, Denis .
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2014, 84 (06) :1313-1328
[9]   Predicting Faults Using the Complexity of Code Changes [J].
Hassan, Ahmed E. .
2009 31ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, PROCEEDINGS, 2009, :78-88
[10]   Studying the dialogue between users and developers of free apps in the Google Play Store [J].
Hassan, Safwat ;
Tantithamthavorn, Chakkrit ;
Bezemer, Cor-Paul ;
Hassan, Ahmed E. .
EMPIRICAL SOFTWARE ENGINEERING, 2018, 23 (03) :1275-1312