When less is more: on the value of “co-training” for semi-supervised software defect predictors

被引：0

作者：

Suvodeep Majumder

Joymallya Chakraborty

Tim Menzies

机构：

[1] North Carolina State University,Department of Computer Science

来源：

Empirical Software Engineering | 2024年 / 29卷

关键词：

Semi-supervised learning; SSL; Self-training; Co-training; Boosting methods; Semi-supervised preprocessing; Clustering-based semi-supervised preprocessing; Intrinsically semi-supervised methods; Graph-based methods; Co-forest; Effort aware tri-training;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects– and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised “co-training methods” work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user’s specific goals. Also, we warn that a commonly-used co-training method (“multi-view”– where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.

引用

共 50 条

[21] SEMI-SUPERVISED LEARNING WITH CO-TRAINING FOR DATA-DRIVEN PROGNOSTICS
Hu, Chao
Youn, Byeng D.
Kim, Taejin
PROCEEDINGS OF THE ASME INTERNATIONAL DESIGN ENGINEERING TECHNICAL CONFERENCES AND COMPUTERS AND INFORMATION IN ENGINEERING CONFERENCE, 2011, VOL 2, PTS A AND B, 2012, : 1297 - 1306
[22] A SEMI-SUPERVISED METHOD FOR SAR TARGET DISCRIMINATION BASED ON CO-TRAINING
Du, Lan
Wang, Yan
Xie, Weitong
2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 9482 - 9485
[23] Inductive Semi-supervised Multi-Label Learning with Co-Training
Zhan, Wang
Zhang, Min-Ling
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 1305 - 1314
[24] Semi-Supervised Learning of Alternatively Spliced Exons Using Co-Training
Tangirala, Karthik
Caragea, Doina
2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, : 243 - 246
[25] Co-training semi-supervised active learning algorithm with noise filter
School of Computer Science and Telecommunication Engineering, Jiangsu University, Zhenjiang 212013, China
Moshi Shibie yu Rengong Zhineng, 2009, 5 (750-755):
[26] A Co-training Based Semi-supervised Human Action Recognition Algorithm
Yuan, Hejin
Wang, Cuiru
Liu, Jun
MANUFACTURING SYSTEMS AND INDUSTRY APPLICATIONS, 2011, 267 : 1065 - 1070
[27] Stacked co-training for semi-supervised multi-label learning
Li, Jiaxuan
Zhu, Xiaoyan
Wang, Hongrui
Zhang, Yu
Wang, Jiayin
INFORMATION SCIENCES, 2024, 677
[28] Safe Multi-view Co-training for Semi-supervised Regression
Liu, Li Yan
Huang, Peng
Min, Fan
2022 IEEE 9TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2022, : 56 - 65
[29] Co-training generative adversarial networks for semi-supervised classification method
Xu, Zhe
Geng, Jie
Jiang, Wen
Zhang, Zhuo
Zeng, Qing-Jie
Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2021, 29 (05): : 1127 - 1135
[30] A semi-supervised extreme learning machine method based on co-training
Li, Kunlun
Zhang, Juan
Xu, Hongyu
Luo, Shangzong
Li, Hexin
Journal of Computational Information Systems, 2013, 9 (01): : 207 - 214

← 1 2 3 4 5 →