CBReT: A Cluster-Based Resampling Technique for dealing with imbalanced data in code smell prediction

被引:3
|
作者
Thakur, Praveen Singh [1 ]
Jadeja, Mahipal [1 ]
Chouhan, Satyendra Singh [1 ]
机构
[1] MNIT, Dept CSE, Jaipur 302017, India
关键词
Code smell prediction; Imbalance learning; Oversampling; Software maintainability; Empirical study;
D O I
10.1016/j.knosys.2024.111390
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Code smell refers to substandard design patterns in software's source code that may lead to faults -prone implementation. Machine learning -based code smell prediction models suffer from data imbalance problems, i.e., one class contains significantly more instances than another. The existing oversampling approaches, such as SMOTE (Synthetic Minority Over -sampling Technique), have been used for balancing the code smell dataset by generating synthetic samples for the minority class. However, the distribution of classes of code smell datasets is overlapped; hence, randomly generated instances can damage the decision boundary between both classes. This paper addresses this issue and proposes a novel Cluster -Based Resampling Technique, CBReT, that generates synthetic instances by considering the distribution of the code smell data. The CBReT first formulates clusters (containing minority and majority instances) based on the data distribution using Gaussian Mixture Model (GMM). Next, each cluster is balanced separately by synthesizing minority instances. While balancing the clusters, the CBReT also checks the validity of the synthetic instances so that each synthetic instance holds similar properties as the other minority instances. To assess the performance of CBReT, extensive experiments have been conducted on the four publicly available benchmark code smell datasets. We have used various performance metrics to evaluate our model's performance. The experimental results show that the CBReT technique significantly increased the performance of the code smell prediction model by 0.18% (min) and 9.08% (max) compared to the state-of-the-art imbalance learning approaches.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] BiblioMapper: A cluster-based information visualization technique
    Song, M
    IEEE SYMPOSIUM ON INFORMATION VISUALIZATION - PROCEEDINGS, 1998, : 130 - 136
  • [42] A NEW RESAMPLING METHOD OF IMBALANCED LARGE DATA BASED ON CLASS BOUNDARY
    Xing Sheng
    Zhai Junhai
    Wang Xiaolan
    Yuan Ming
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOL. 2, 2015, : 826 - 831
  • [43] Preprocessing method based on sample resampling for imbalanced data of electronic circuits
    Li R.
    Xu A.
    Sun W.
    Wu Y.
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2020, 42 (11): : 2654 - 2660
  • [44] Autonomic active learning strategy using cluster-based ensemble classifier for concept drifts in imbalanced data stream
    Halder, Bohnishikha
    Hasan, K. M. Azharul
    Amagasa, Toshiyuki
    Ahmed, Md Manjur
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
  • [45] Cluster-Based Cooperative Data Service for VANETs
    Shi, Yongyue
    Peng, Xiao-Hong
    Shen, Hang
    Bai, Guangwei
    WIRELESS INTERNET (WICON 2017), 2018, 230 : 119 - 129
  • [46] A Cluster-Based Data Fusion Technique to Analyze Big Data in Wireless Multi-Sensor System
    Din, Sadia
    Ahmad, Awais
    Paul, Anand
    Rathore, Muhammad Mazhar Ullah
    Jeon, Gwanggil
    IEEE ACCESS, 2017, 5 : 5069 - 5083
  • [47] Localization techniques for cluster-based data grid
    Hsu, CH
    Lin, GH
    Li, KC
    Yang, CT
    DISTRIBUTED AND PARALLEL COMPUTING, 2005, 3719 : 83 - 92
  • [48] A Cluster-Based Cooperative Data Transmission in VANETs
    Fu, Qi
    Chen, Anhua
    Jiang, Yunxia
    Tang, Mingdong
    COLLABORATE COMPUTING: NETWORKING, APPLICATIONS AND WORKSHARING, COLLABORATECOM 2016, 2017, 201 : 563 - 568
  • [49] Cluster-based Data Reduction for Persistent Homology
    Moitra, Anindya
    Malott, Nicholas O.
    Wilsey, Philip A.
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 327 - 334
  • [50] Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification
    Jinyan Li
    Simon Fong
    Yunsick Sung
    Kyungeun Cho
    Raymond Wong
    Kelvin K. L. Wong
    BioData Mining, 9