When less is more: on the value of “co-training” for semi-supervised software defect predictors

被引：0

作者：

Suvodeep Majumder

Joymallya Chakraborty

Tim Menzies

机构：

[1] North Carolina State University,Department of Computer Science

来源：

Empirical Software Engineering | 2024年 / 29卷

关键词：

Semi-supervised learning; SSL; Self-training; Co-training; Boosting methods; Semi-supervised preprocessing; Clustering-based semi-supervised preprocessing; Intrinsically semi-supervised methods; Graph-based methods; Co-forest; Effort aware tri-training;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects– and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised “co-training methods” work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user’s specific goals. Also, we warn that a commonly-used co-training method (“multi-view”– where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.

引用

共 50 条

[41] A Semi-Supervised Approach to Software Defect Prediction
Lu, Huihua
Cukic, Bojan
Culp, Mark
2014 IEEE 38TH ANNUAL INTERNATIONAL COMPUTERS, SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), 2014, : 416 - 425
[42] RFID indoor positioning based on semi-supervised actor-critic co-training
Li L.
Jiali Z.
Yixuan Q.
Zihan L.
Yingchao L.
Tianxing H.
Journal of China Universities of Posts and Telecommunications, 2020, 27 (05): : 69 - 81
[43] Co-training Semi-supervised Learning for Single-Target Regression in Data Streams Using AMRules
Sousa, Ricardo
Gama, Joao
FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2017, 2017, 10352 : 499 - 508
[44] Semi-Supervised Co-Training Model Using Convolution and Transformer for Hyperspectral Image Classification
Zhao, Feng
Song, Xiqun
Zhang, Junjie
Liu, Hanqiang
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21
[45] HIGH ACCURATE INTERNET TRAFFIC CLASSIFICATION BASED ON CO-TRAINING SEMI-SUPERVISED CLUSTERING
Li, Xiang
Qi, Feng
Yu, Li Kun
Qiu, Xue Song
PROCEEDINGS OF THE 2010 INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENCE AND AWARENESS INTERNET, AIAI2010, 2010, : 193 - 197
[46] Research on semi-supervised heterogeneous adaptive co-training soft-sensor model
Li D.
Huang D.
Liu Y.
Huagong Xuebao/CIESC Journal, 2020, 71 (05): : 2128 - 2138
[47] Co-training partial least squares model for semi-supervised soft sensor development
Bao, Liang
Yuan, Xiaofeng
Ge, Zhiqiang
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2015, 147 : 75 - 85
[48] Semi-Supervised Learning Approach for Indonesian Named Entity Recognition (NER) Using Co-Training Algorithm
Aryoyudanta, Bayu
Adji, Teguh Bharata
Llidayah, Lndriana
2016 INTERNATIONAL SEMINAR ON INTELLIGENT TECHNOLOGY AND ITS APPLICATIONS (ISITIA): RECENT TRENDS IN INTELLIGENT COMPUTATIONAL TECHNOLOGIES FOR SUSTAINABLE ENERGY, 2016, : 7 - 11
[49] Multi-head co-training: An uncertainty-aware and robust semi-supervised learning framework
Chen, Mingcai
Wang, Chongjun
KNOWLEDGE-BASED SYSTEMS, 2024, 302
[50] Semi-supervised Software Defect Prediction Model Based on Tri-training
Meng, Fanqi
Cheng, Wenying
Wang, Jingdong
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2021, 15 (11): : 4028 - 4042

← 1 2 3 4 5 →