Estimating Uncertainty in Labeled Changes by SZZ Tools on Just-In-Time Defect Prediction

被引:1
作者
Guo, Shikai [1 ]
Li, Dongmin [1 ]
Huang, Lin [1 ]
Lv, Sijia [1 ]
Chen, Rong [1 ]
Li, Hui [1 ]
Li, Xiaochen [2 ]
Jiang, He [2 ]
机构
[1] Dalian Maritime Univ, Dalian, Peoples R China
[2] Dalian Univ Technol, Dalian, Peoples R China
基金
中国国家自然科学基金;
关键词
Just-in-time defect prediction; SZZ tools; confident learning; imbalance;
D O I
10.1145/3637226
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The aim of Just-In-Time ( JIT) defect prediction is to predict software changes that are prone to defects in a project in a timely manner, thereby improving the efficiency of software development and ensuring software quality. Identifying changes that introduce bugs is a critical task in just-in-time defect prediction, and researchers have introduced the SZZ approach and its variants to label these changes. However, it has been shown that different SZZ algorithms introduce noise to the dataset to a certain extent, which may reduce the predictive performance of the model. To address this limitation, we propose the Confident Learning Imbalance (CLI) model. The model identifies and excludes samples whose labels may be corrupted by estimating the joint distribution of noisy labels and true labels, and mitigates the impact of noisy data on the performance of the prediction model. The CLI consists of two components: identifying noisy data (Confident Learning Component) and generating a predicted probability matrix for imbalanced data (Imbalanced Data Probabilistic Prediction Component). The IDPP component generates precise predicted probabilities for each instance in the training set, while the CL component uses the generated predicted probability matrix and noise labels to clean up the noise and build a classification model. We evaluate the performance of our model through extensive experiments on a total of 126,526 changes from ten Apache open source projects, and the results show that our model outperforms the baseline methods.
引用
收藏
页数:25
相关论文
共 44 条
  • [1] Angluin D., 1988, Machine Learning, V2, P343, DOI 10.1007/BF00116829
  • [2] [蔡亮 Cai Liang], 2019, [软件学报, Journal of Software], V30, P1288
  • [3] Neto EC, 2018, 2018 25TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2018), P380, DOI 10.1109/SANER.2018.8330225
  • [4] Evaluating defect prediction approaches: a benchmark and an extensive comparison
    D'Ambros, Marco
    Lanza, Michele
    Robbes, Romain
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2012, 17 (4-5) : 531 - 577
  • [5] A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes
    da Costa, Daniel Alencar
    McIntosh, Shane
    Shang, Weiyi
    Kulesza, Uira
    Coelho, Roberta
    Hassan, Ahmed E.
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2017, 43 (07) : 641 - 657
  • [6] Elkan C., 2001, P INT JOINT C ART IN, V2, P973
  • [7] The Impact of Dormant Defects on Defect Prediction: A Study of 19 Apache Projects
    Falessi, Davide
    Ahluwalia, Aalok
    Di Penta, Massimiliano
    [J]. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2022, 31 (01)
  • [8] The Impact of Mislabeled Changes by SZZ on Just-in-Time Defect Prediction
    Fan, Yuanrui
    Xia, Xin
    da Costa, Daniel Alencar
    Lo, David
    Hassan, Ahmed E.
    Li, Shanping
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (08) : 1559 - 1586
  • [9] github, 2021, CLI Details
  • [10] HAN J, 2000, DATA MINING CONCEPTS