When less is more: on the value of "co-training" for semi-supervised software defect predictors

被引:3
作者
Majumder, Suvodeep [1 ]
Chakraborty, Joymallya [1 ]
Menzies, Tim [1 ]
机构
[1] North Carolina State Univ, Dept Comp Sci, Raleigh, NC 27606 USA
基金
美国国家科学基金会;
关键词
Semi-supervised learning; SSL; Self-training; Co-training; Boosting methods; Semi-supervised preprocessing; Clustering-based semi-supervised preprocessing; Intrinsically semi-supervised methods; Graph-based methods; Co-forest; Effort aware tri-training; SUPPORT VECTOR MACHINE; FEATURE-SELECTION; MODELS; SEARCH;
D O I
10.1007/s10664-023-10418-4
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects- and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.
引用
收藏
页数:33
相关论文
共 135 条
  • [1] An empirical study based on semi-supervised hybrid self-organizing map for software fault prediction
    Abaei, Golnoush
    Selamat, Ali
    Fujita, Hamido
    [J]. KNOWLEDGE-BASED SYSTEMS, 2015, 74 : 28 - 39
  • [2] Abney S, 2002, 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P360
  • [3] Agrawal A, 2017, CoRR abs/1705.03697
  • [4] [Anonymous], 2003, Adv Neural Inf Process Syst
  • [5] [Anonymous], 2000, ICML
  • [6] A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering
    Arcuri, Andrea
    Briand, Lionel
    [J]. 2011 33RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2011, : 1 - 10
  • [7] Semi-supervised clustering methods
    Bair, Eric
    [J]. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2013, 5 (05): : 349 - 361
  • [8] Balcan MF., 2004, Adv Neural Inf Process Syst, V17, P2
  • [9] Balogun A. O., 2018, J. Eng. Technol., V3, P50, DOI [10.46792/fuoyejet.v3i2.200, DOI 10.46792/FUOYEJET.V3I2.200]
  • [10] The limited impact of individual developer data on software defect prediction
    Bell, Robert M.
    Ostrand, Thomas J.
    Weyuker, Elaine J.
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2013, 18 (03) : 478 - 505