An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction

被引:52
作者
Tabassum, Sadia [1 ]
Minku, Leandro L. [1 ]
Feng, Danyi [2 ]
Cabral, George G. [3 ]
Song, Liyan [1 ]
机构
[1] Univ Birmingham, Birmingham, W Midlands, England
[2] Xiliu Tech, Beijing, Peoples R China
[3] Univ Fed Rural Pernambuco, Recife, PE, Brazil
来源
2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2020) | 2020年
基金
英国工程与自然科学研究理事会;
关键词
Software defect prediction; cross-project learning; transfer learning; online learning; verification latency; concept drift; class imbalance; MACHINE;
D O I
10.1145/3377811.3380403
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time. We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.
引用
收藏
页码:554 / 565
页数:12
相关论文
共 35 条
[1]   Is "Better Data" Better Than "Better Data Miners"? On the Benefits of Tuning SMOTE for Defect Prediction [J].
Agrawal, Amritanshu ;
Menzies, Tim .
PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, :1050-1061
[2]   Class Imbalance Evolution and Verification Latency in Just-in-Time Software Defect Prediction [J].
Cabral, George G. ;
Minku, Leandro L. ;
Shihab, Emad ;
Mujahid, Suhaib .
2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2019), 2019, :666-676
[3]   Multi-Objective Cross-Project Defect Prediction [J].
Canfora, Gerardo ;
De Lucia, Andrea ;
Di Penta, Massimiliano ;
Oliveto, Rocco ;
Panichella, Annibale ;
Panichella, Sebastiano .
2013 IEEE SIXTH INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION (ICST 2013), 2013, :252-261
[4]  
Catolino G, 2019, 2019 IEEE/ACM 6TH INTERNATIONAL CONFERENCE ON MOBILE SOFTWARE ENGINEERING AND SYSTEMS (MOBILESOFT 2019), P99, DOI 10.1109/MOBILESoft.2019.00023
[5]   MULTI: Multi-objective effort-aware just-in-time software defect prediction [J].
Chen, Xiang ;
Zhao, Yingquan ;
Wang, Qiuping ;
Yuan, Zhidan .
INFORMATION AND SOFTWARE TECHNOLOGY, 2018, 93 :1-13
[6]  
Demsar J, 2006, J MACH LEARN RES, V7, P1
[7]   Learning in Nonstationary Environments: A Survey [J].
Ditzler, Gregory ;
Roveri, Manuel ;
Alippi, Cesare ;
Polikar, Robi .
IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, 2015, 10 (04) :12-25
[8]  
Domingos P., 2000, Proceedings. KDD-2000. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P71, DOI 10.1145/347090.347107
[9]   On evaluating stream learning algorithms [J].
Gama, Joao ;
Sebastiao, Raquel ;
Rodrigues, Pedro Pereira .
MACHINE LEARNING, 2013, 90 (03) :317-346
[10]   Empirical validation of object-oriented metrics on open source software for fault prediction [J].
Gyimóthy, T ;
Ferenc, R ;
Siket, I .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2005, 31 (10) :897-910