An Empirical Study on Data Balancing in Machine Learning Based Software Traceability Methods

被引：1

作者：

Wang, Bangchao ^{[1
,2
]}

Wang, Zihan ^{[3
]}

Wan, Hongyan ^{[1
,2
]}

Li, Xingfu ^{[1
]}

Deng, Yang ^{[1
]}

机构：

[1] Wuhan Text Univ, Sch Comp Sci & Artificial Intelligence, Wuhan, Peoples R China

[2] Wuhan Text Univ, Engn Res Ctr Hubei Prov Clothing Informat, Wuhan, Peoples R China

[3] Wuhan Text Univ, Sch Math & Phys Sci, Wuhan, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

基金：

中国国家自然科学基金;

关键词：

Machine learning; Data balancing; Software traceability; Software engineering;

D O I：

10.1109/IJCNN54540.2023.10191386

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Machine learning (ML) has been widely used in trace link recovery (TLR) to reduce the manual maintenance cost of trace links by developers. However, the imbalanced distribution of valid links and invalid links seriously affects the performance of classifiers. Although a few studies have applied data balancing techniques (DBT) to ML-based TLR, none of them has systematically analyzed more effective combinations of them. Therefore, we perform an empirical study on three groups of control experiments to explore the impact of the combination of different ML methods with and without DBT on TLR efficiency. We compare the performance of supervised ML-based TLR and unsupervised ML-based TLR with and without DBT respectively. Then, we analyze the performance of the ensemble learning model (EM) with DBT on TLR. The experimental results on the 7 imbalance datasets of CoEST indicate that DBT has a positive effect on ML-based TLR. Specifically, the recall of the LR model increased by 0.5517 after combining with most DBTs on EasyClinic(ID-TC), while Tomek-link significantly improves the precision of K-Nearest Neighbor (KNN), Decision Tree (DT), LR, Support Vector Machine (SVM). The precision of LR increased from 0.5036 to 1.0. BalanceRF is best at increasing recall, reaching 1.0 on 4 datasets. Moreover,the improvement degree of ML-based TLR with DBT shows differences in terms of the size of datasets and the proportion of valid links.

引用

页数：8

共 36 条

[1] Semi-Automated Feature Traceability with Embedded Annotations
Abukwaik, Hadil
Burger, Andreas
Andam, Berima Kweku
Berger, Thorsten
[J]. PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), 2018, : 529 - 533
[2] A Literature Review of Automatic Traceability Links Recovery for Software Change Impact Analysis
Aung, Thazin Win Win
Huo, Huan
Sui, Yulei
[J]. 2020 IEEE/ACM 28TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC, 2020, : 14 - 24
[3] Batista G.E., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI [10.1145/1007730.1007735, 10.1145/1007730.1007735.2, DOI 10.1145/1007730.1007735]
[4] Batista G. E., 2003, Wob, V3, P10
[5] Bouillon Elke, 2013, Requirements Engineering: Foundation for Software Quality. 19th International Working Conference, REFSQ 2013. Proceedings, P158, DOI 10.1007/978-3-642-37422-7_12
[6] Toward accurate link between code and software documentation
Cao, Yingkui
Zou, Yanzhen
Luo, Yuxiang
Xie, Bing
Zhao, Junfeng
[J]. SCIENCE CHINA-INFORMATION SCIENCES, 2018, 61 (05)
[7] SMOTE: Synthetic minority over-sampling technique
Chawla, Nitesh V.
Bowyer, Kevin W.
Hall, Lawrence O.
Kegelmeyer, W. Philip
[J]. 2002, American Association for Artificial Intelligence (16)
[8] Automatic traceability link recovery via active learning
Du, Tian-bao
Shen, Guo-hua
Huang, Zhi-qiu
Yu, Yao-shen
Wu, De-xiang
[J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2020, 21 (08) : 1217 - 1225
[9] Leveraging Historical Associations between Requirements and Source Code to Identify Impacted Classes
Falessi, Davide
Roll, Justin
Guo, Jin L. C.
Cleland-Huang, Jane
[J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020, 46 (04) : 420 - 441
[10] Estimating the number of remaining links in traceability recovery
Falessi, Davide
Di Penta, Massimiliano
Canfora, Gerardo
Cantone, Giovanni
[J]. EMPIRICAL SOFTWARE ENGINEERING, 2017, 22 (03) : 996 - 1027

← 1 2 3 4 →