When less is more: on the value of “co-training” for semi-supervised software defect predictors

被引:0
|
作者
Suvodeep Majumder
Joymallya Chakraborty
Tim Menzies
机构
[1] North Carolina State University,Department of Computer Science
来源
Empirical Software Engineering | 2024年 / 29卷
关键词
Semi-supervised learning; SSL; Self-training; Co-training; Boosting methods; Semi-supervised preprocessing; Clustering-based semi-supervised preprocessing; Intrinsically semi-supervised methods; Graph-based methods; Co-forest; Effort aware tri-training;
D O I
暂无
中图分类号
学科分类号
摘要
Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects– and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised “co-training methods” work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user’s specific goals. Also, we warn that a commonly-used co-training method (“multi-view”– where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.
引用
收藏
相关论文
共 50 条
  • [1] When less is more: on the value of "co-training" for semi-supervised software defect predictors
    Majumder, Suvodeep
    Chakraborty, Joymallya
    Menzies, Tim
    EMPIRICAL SOFTWARE ENGINEERING, 2024, 29 (02)
  • [2] Semi-Supervised Regression with Co-Training
    Zhou, Zhi-Hua
    Li, Ming
    19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), 2005, : 908 - 913
  • [3] Safe co-training for semi-supervised regression
    Liu, Liyan
    Huang, Peng
    Yu, Hong
    Min, Fan
    INTELLIGENT DATA ANALYSIS, 2023, 27 (04) : 959 - 975
  • [4] Deep co-training for semi-supervised image segmentation
    Peng, Jizong
    Estrada, Guillermo
    Pedersoli, Marco
    Desrosiers, Christian
    PATTERN RECOGNITION, 2020, 107 (107)
  • [5] Semi-Supervised Classification with Co-training for Deep Web
    Fang Wei
    Cui Zhiming
    ADVANCED MEASUREMENT AND TEST, PARTS 1 AND 2, 2010, 439-440 : 183 - +
  • [6] Spatial co-training for semi-supervised image classification
    Hong, Yi
    Zhu, Weiping
    PATTERN RECOGNITION LETTERS, 2015, 63 : 59 - 65
  • [7] Semi-supervised Learning for Regression with Co-training by Committee
    Hady, Mohamed Farouk Abdel
    Schwenker, Friedhelm
    Palm, Guenther
    ARTIFICIAL NEURAL NETWORKS - ICANN 2009, PT I, 2009, 5768 : 121 - 130
  • [8] Deep Co-Training for Semi-Supervised Image Recognition
    Qiao, Siyuan
    Shen, Wei
    Zhang, Zhishuai
    Wang, Bo
    Yuille, Alan
    COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 142 - 159
  • [9] RSSalg software: A tool for flexible experimenting with co-training based semi-supervised algorithms
    Slivka, J.
    Sladic, G.
    Milosavljevic, B.
    Kovaevic, A.
    KNOWLEDGE-BASED SYSTEMS, 2017, 121 : 4 - 6
  • [10] Co-Training based Semi-Supervised Web Spam Detection
    Wang, Wei
    Lee, Xiao-Dong
    Hu, An-Lei
    Geng, Guang-Gang
    2013 10TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2013, : 789 - 793