When less is more: on the value of “co-training” for semi-supervised software defect predictors

被引：0

作者：

Suvodeep Majumder

Joymallya Chakraborty

Tim Menzies

机构：

[1] North Carolina State University,Department of Computer Science

来源：

Empirical Software Engineering | 2024年 / 29卷

关键词：

Semi-supervised learning; SSL; Self-training; Co-training; Boosting methods; Semi-supervised preprocessing; Clustering-based semi-supervised preprocessing; Intrinsically semi-supervised methods; Graph-based methods; Co-forest; Effort aware tri-training;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects– and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised “co-training methods” work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user’s specific goals. Also, we warn that a commonly-used co-training method (“multi-view”– where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.

引用

共 50 条

[1] When less is more: on the value of "co-training" for semi-supervised software defect predictors
Majumder, Suvodeep
Chakraborty, Joymallya
Menzies, Tim
EMPIRICAL SOFTWARE ENGINEERING, 2024, 29 (02)
[2] Semi-Supervised Regression with Co-Training
Zhou, Zhi-Hua
Li, Ming
19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), 2005, : 908 - 913
[3] Safe co-training for semi-supervised regression
Liu, Liyan
Huang, Peng
Yu, Hong
Min, Fan
INTELLIGENT DATA ANALYSIS, 2023, 27 (04) : 959 - 975
[4] Deep co-training for semi-supervised image segmentation
Peng, Jizong
Estrada, Guillermo
Pedersoli, Marco
Desrosiers, Christian
PATTERN RECOGNITION, 2020, 107 (107)
[5] Semi-Supervised Classification with Co-training for Deep Web
Fang Wei
Cui Zhiming
ADVANCED MEASUREMENT AND TEST, PARTS 1 AND 2, 2010, 439-440 : 183 - +
[6] Spatial co-training for semi-supervised image classification
Hong, Yi
Zhu, Weiping
PATTERN RECOGNITION LETTERS, 2015, 63 : 59 - 65
[7] Semi-supervised Learning for Regression with Co-training by Committee
Hady, Mohamed Farouk Abdel
Schwenker, Friedhelm
Palm, Guenther
ARTIFICIAL NEURAL NETWORKS - ICANN 2009, PT I, 2009, 5768 : 121 - 130
[8] Deep Co-Training for Semi-Supervised Image Recognition
Qiao, Siyuan
Shen, Wei
Zhang, Zhishuai
Wang, Bo
Yuille, Alan
COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 142 - 159
[9] RSSalg software: A tool for flexible experimenting with co-training based semi-supervised algorithms
Slivka, J.
Sladic, G.
Milosavljevic, B.
Kovaevic, A.
KNOWLEDGE-BASED SYSTEMS, 2017, 121 : 4 - 6
[10] Co-Training based Semi-Supervised Web Spam Detection
Wang, Wei
Lee, Xiao-Dong
Hu, An-Lei
Geng, Guang-Gang
2013 10TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2013, : 789 - 793

← 1 2 3 4 5 →