Snoring: a Noise in Defect Prediction Datasets

被引:13
作者
Ahluwalia, Aalok [1 ]
Falessi, Davide [1 ]
Di Penta, Massimiliano [2 ]
机构
[1] Calif Polytech State Univ San Luis Obispo, San Luis Obispo, CA 93407 USA
[2] Univ Sannio, Benevento, Italy
来源
2019 IEEE/ACM 16TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2019) | 2019年
关键词
Defect prediction; Fix-inducing changes; Dataset bias;
D O I
10.1109/MSR.2019.00019
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In order to develop and train defect prediction models, researchers rely on datasets in which a defect is often attributed to a release where the defect itself is discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon as "sleeping defects". We call "snoring" the phenomenon where classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this paper we analyze, on data from 282 releases of six open source projects from the Apache ecosystem, the magnitude of the sleeping defects and of the snoring classes. Our results indicate that 1) on all projects, most of the defects in a project slept for more than 20% of the existing releases, and 2) in the majority of the projects the missing rate is more than 25% even if we remove the last 50% of releases.
引用
收藏
页码:63 / 67
页数:5
相关论文
共 31 条
[1]   Is "Better Data" Better Than "Better Data Miners"? On the Benefits of Tuning SMOTE for Defect Prediction [J].
Agrawal, Amritanshu ;
Menzies, Tim .
PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, :1050-1061
[2]  
[Anonymous], 2011, Statistics: The exploration analysis of data
[3]  
Bayley S., 2018, CoRR abs/1801.07194
[4]   Fair and Balanced? Bias in Bug-Fix Datasets [J].
Bird, Christian ;
Bachmann, Adrian ;
Aune, Eirik ;
Duffy, John ;
Bernstein, Abraham ;
Filkov, Vladimir ;
Devanbu, Premkumar .
7TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2009, :121-130
[5]   A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes [J].
da Costa, Daniel Alencar ;
McIntosh, Shane ;
Shang, Weiyi ;
Kulesza, Uira ;
Coelho, Roberta ;
Hassan, Ahmed E. .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2017, 43 (07) :641-657
[6]   Facilitating Feasibility Analysis: The Pilot Defects Prediction Dataset Maker [J].
Falessi, Davide ;
Moede, Max Jason .
PROCEEDINGS OF THE 4TH ACM SIGSOFT INTERNATIONAL WORKSHOP ON SOFTWARE ANALYTICS (SWAN'18), 2018, :15-18
[7]   Tuning for software analytics: Is it really necessary? [J].
Fu, Wei ;
Menzies, Tim ;
Shen, Xipeng .
INFORMATION AND SOFTWARE TECHNOLOGY, 2016, 76 :135-146
[8]   Choosing software metrics for defect prediction: an investigation on feature selection techniques [J].
Gao, Kehan ;
Khoshgoftaar, Taghi M. ;
Wang, Huanjing ;
Seliya, Naeem .
SOFTWARE-PRACTICE & EXPERIENCE, 2011, 41 (05) :579-606
[9]   Conducting quantitative software engineering studies with Alitheia Core [J].
Gousios, Georgios ;
Spinellis, Diomidis .
EMPIRICAL SOFTWARE ENGINEERING, 2014, 19 (04) :885-925
[10]  
Grissom Robert J., 2005, Effect Sizes for Research: A Broad Practical Approach, V2nd