Semi-supervised Pre-processing for Learning-Based Traceability Framework on Real-World Software Projects

被引:11
作者
Dong, Liming [1 ]
Zhang, He [1 ]
Liu, Wei [1 ]
Weng, Zhiluo [1 ]
Kuang, Hongyu [1 ]
机构
[1] Nanjing Univ, Software Inst, State Key Lab Novel Software Technol, Nanjing, Jiangsu, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022 | 2022年
基金
中国国家自然科学基金;
关键词
Software Traceability; Semi-supervised Learning; Learning-based Model; Industry Practice; Data Imbalance; Data Sparsity; LINKS; REQUIREMENTS; RECOVERY; CHALLENGES; IMPACT; CODE;
D O I
10.1145/3540250.3549151
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The traceability of software artifacts has been recognized as an important factor to support various activities in software development processes. However, traceability can be difficult and time-consuming to create and maintain manually, thereby automated approaches have gained much attention. Unfortunately, existing automated approaches for traceability suffer from practical issues. This paper aims to gain an understanding of the potential challenges for the underperforming of the state-of-the-art, ML-based trace link classifiers applied in real-world projects. By investigating different industrial datasets, we found that two critical (and classic) challenges, i.e. data imbalance and sparse problems, lie in real-world projects' traceability automation. To overcome these challenges, we developed a framework called SPLINT to incorporate hybrid textual similarity measures and semi-supervised learning strategies as enhancements to the learning-based traceability approaches. We carried out experiments with six open-source platforms and ten industry datasets. The results confirm that SPLINT is able to operate at higher performance on two communities' datasets. Specifically, the industrial datasets, which significantly suffer from data imbalance and sparsity problems, show an increase in F2-score over 14% and AUC over 8% on average. The adjusted class-balancing and self-training policies used in SPLINT (CBST-Adjust) also work effectively for the selection of pseudo-labels on minor classes from unlabeled trace sets, demonstrating SPLINT's practicability.
引用
收藏
页码:570 / 582
页数:13
相关论文
共 74 条
[1]   A traceability technique for specifications [J].
Abadi, Aharcin ;
Nisenson, Mordechai ;
Simionovici, Yahalomit .
PROCEEDINGS OF THE 16TH IEEE INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, 2008, :103-112
[2]   Trustrace: Mining Software Repositories to Improve the Accuracy of Requirement Traceability Links [J].
Ali, Nasir ;
Gueheneuc, Yann-Gael ;
Antoniol, Giuliano .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2013, 39 (05) :725-741
[3]   Recovering traceability links between code and documentation [J].
Antoniol, G ;
Canfora, G ;
Casazza, G ;
De Lucia, A ;
Merlo, E .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2002, 28 (10) :970-983
[4]  
Antoniol G., 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303), P136, DOI 10.1109/WCRE.1999.806954
[5]  
Antoniol G, 2017, Arxiv, DOI arXiv:1710.03129
[6]  
Asuncion Hazeline U., 2010, P 32 INT C SOFTW ENG, P95, DOI [10.1145/1806799.1806817, DOI 10.1145/1806799.1806817]
[7]   A Literature Review of Automatic Traceability Links Recovery for Software Change Impact Analysis [J].
Aung, Thazin Win Win ;
Huo, Huan ;
Sui, Yulei .
2020 IEEE/ACM 28TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC, 2020, :14-24
[8]   Semi-supervised Approach for Recovering Traceability Links in Complex Systems [J].
Bella, Emma Effa ;
Gervais, Marie-Pierre ;
Bendraou, Reda ;
Wouters, Laurent ;
Koudri, Ali .
2018 23RD INTERNATIONAL CONFERENCE ON ENGINEERING OF COMPLEX COMPUTER SYSTEMS (ICECCS), 2018, :193-196
[9]   A consolidated process for software process simulation: State of the Art and Industry Experience [J].
Bin Ali, Nauman ;
Petersen, Kai .
2012 38TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS (SEAA), 2012, :327-336
[10]  
Bohan Liu, 2020, EASE2020. Proceedings of the Evaluation and Assessment in Software Engineering, P21, DOI 10.1145/3383219.3383222