An Empirical Study on Data Balancing in Machine Learning Based Software Traceability Methods

被引:1
作者
Wang, Bangchao [1 ,2 ]
Wang, Zihan [3 ]
Wan, Hongyan [1 ,2 ]
Li, Xingfu [1 ]
Deng, Yang [1 ]
机构
[1] Wuhan Text Univ, Sch Comp Sci & Artificial Intelligence, Wuhan, Peoples R China
[2] Wuhan Text Univ, Engn Res Ctr Hubei Prov Clothing Informat, Wuhan, Peoples R China
[3] Wuhan Text Univ, Sch Math & Phys Sci, Wuhan, Peoples R China
来源
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年
基金
中国国家自然科学基金;
关键词
Machine learning; Data balancing; Software traceability; Software engineering;
D O I
10.1109/IJCNN54540.2023.10191386
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning (ML) has been widely used in trace link recovery (TLR) to reduce the manual maintenance cost of trace links by developers. However, the imbalanced distribution of valid links and invalid links seriously affects the performance of classifiers. Although a few studies have applied data balancing techniques (DBT) to ML-based TLR, none of them has systematically analyzed more effective combinations of them. Therefore, we perform an empirical study on three groups of control experiments to explore the impact of the combination of different ML methods with and without DBT on TLR efficiency. We compare the performance of supervised ML-based TLR and unsupervised ML-based TLR with and without DBT respectively. Then, we analyze the performance of the ensemble learning model (EM) with DBT on TLR. The experimental results on the 7 imbalance datasets of CoEST indicate that DBT has a positive effect on ML-based TLR. Specifically, the recall of the LR model increased by 0.5517 after combining with most DBTs on EasyClinic(ID-TC), while Tomek-link significantly improves the precision of K-Nearest Neighbor (KNN), Decision Tree (DT), LR, Support Vector Machine (SVM). The precision of LR increased from 0.5036 to 1.0. BalanceRF is best at increasing recall, reaching 1.0 on 4 datasets. Moreover,the improvement degree of ML-based TLR with DBT shows differences in terms of the size of datasets and the proportion of valid links.
引用
收藏
页数:8
相关论文
共 36 条
  • [1] Semi-Automated Feature Traceability with Embedded Annotations
    Abukwaik, Hadil
    Burger, Andreas
    Andam, Berima Kweku
    Berger, Thorsten
    [J]. PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), 2018, : 529 - 533
  • [2] A Literature Review of Automatic Traceability Links Recovery for Software Change Impact Analysis
    Aung, Thazin Win Win
    Huo, Huan
    Sui, Yulei
    [J]. 2020 IEEE/ACM 28TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC, 2020, : 14 - 24
  • [3] Batista G.E., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI [10.1145/1007730.1007735, 10.1145/1007730.1007735.2, DOI 10.1145/1007730.1007735]
  • [4] Batista G. E., 2003, Wob, V3, P10
  • [5] Bouillon Elke, 2013, Requirements Engineering: Foundation for Software Quality. 19th International Working Conference, REFSQ 2013. Proceedings, P158, DOI 10.1007/978-3-642-37422-7_12
  • [6] Toward accurate link between code and software documentation
    Cao, Yingkui
    Zou, Yanzhen
    Luo, Yuxiang
    Xie, Bing
    Zhao, Junfeng
    [J]. SCIENCE CHINA-INFORMATION SCIENCES, 2018, 61 (05)
  • [7] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [8] Automatic traceability link recovery via active learning
    Du, Tian-bao
    Shen, Guo-hua
    Huang, Zhi-qiu
    Yu, Yao-shen
    Wu, De-xiang
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2020, 21 (08) : 1217 - 1225
  • [9] Leveraging Historical Associations between Requirements and Source Code to Identify Impacted Classes
    Falessi, Davide
    Roll, Justin
    Guo, Jin L. C.
    Cleland-Huang, Jane
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020, 46 (04) : 420 - 441
  • [10] Estimating the number of remaining links in traceability recovery
    Falessi, Davide
    Di Penta, Massimiliano
    Canfora, Gerardo
    Cantone, Giovanni
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2017, 22 (03) : 996 - 1027