CBReT: A Cluster-Based Resampling Technique for dealing with imbalanced data in code smell prediction

被引:3
|
作者
Thakur, Praveen Singh [1 ]
Jadeja, Mahipal [1 ]
Chouhan, Satyendra Singh [1 ]
机构
[1] MNIT, Dept CSE, Jaipur 302017, India
关键词
Code smell prediction; Imbalance learning; Oversampling; Software maintainability; Empirical study;
D O I
10.1016/j.knosys.2024.111390
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Code smell refers to substandard design patterns in software's source code that may lead to faults -prone implementation. Machine learning -based code smell prediction models suffer from data imbalance problems, i.e., one class contains significantly more instances than another. The existing oversampling approaches, such as SMOTE (Synthetic Minority Over -sampling Technique), have been used for balancing the code smell dataset by generating synthetic samples for the minority class. However, the distribution of classes of code smell datasets is overlapped; hence, randomly generated instances can damage the decision boundary between both classes. This paper addresses this issue and proposes a novel Cluster -Based Resampling Technique, CBReT, that generates synthetic instances by considering the distribution of the code smell data. The CBReT first formulates clusters (containing minority and majority instances) based on the data distribution using Gaussian Mixture Model (GMM). Next, each cluster is balanced separately by synthesizing minority instances. While balancing the clusters, the CBReT also checks the validity of the synthetic instances so that each synthetic instance holds similar properties as the other minority instances. To assess the performance of CBReT, extensive experiments have been conducted on the four publicly available benchmark code smell datasets. We have used various performance metrics to evaluate our model's performance. The experimental results show that the CBReT technique significantly increased the performance of the code smell prediction model by 0.18% (min) and 9.08% (max) compared to the state-of-the-art imbalance learning approaches.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Hierarchical cluster-based IELM for financial distress prediction with imbalanced data
    Amal Ibrahim Al Ali
    S. Sheeja Rani
    P. V. Pravija Raj
    Ahmed M. Khedr
    Neural Computing and Applications, 2025, 37 (5) : 2925 - 2943
  • [2] Cluster-based sampling of multiclass imbalanced data
    Prachuabsupakij, Wanthanee
    Soonthornphisaj, Nuanwan
    INTELLIGENT DATA ANALYSIS, 2014, 18 (06) : 1109 - 1135
  • [3] A Cluster-based Regrouping Approach for Imbalanced Data Distributions
    Yu, Wen
    Jiang, ShengYi
    2012 WORLD AUTOMATION CONGRESS (WAC), 2012,
  • [4] Cluster-Based Instance Selection for the Imbalanced Data Classification
    Czarnowski, Ireneusz
    Jedrzejowicz, Piotr
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2018, PT II, 2018, 11056 : 191 - 200
  • [5] Cluster-based sampling approaches to imbalanced data distributions
    Yen, Show-Jane
    Lee, Yue-Shi
    DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4081 : 427 - 436
  • [6] A cluster-based hybrid sampling approach for imbalanced data classification
    Feng, Shou
    Zhao, Chunhui
    Fu, Ping
    REVIEW OF SCIENTIFIC INSTRUMENTS, 2020, 91 (05):
  • [7] A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset
    Le, Tuong
    Son, Le Hoang
    Minh Thanh Vo
    Lee, Mi Young
    Baik, Sung Wook
    SYMMETRY-BASEL, 2018, 10 (07):
  • [8] Comparison of resampling methods for dealing with imbalanced data in binary classification problem
    Park, Geun U.
    Jun, Inkyun G.
    KOREAN JOURNAL OF APPLIED STATISTICS, 2019, 32 (03) : 349 - 374
  • [9] Cluster-Based Prediction for Batteries in Data Centers
    Haider, Syed Naeem
    Zhao, Qianchuan
    Li, Xueliang
    ENERGIES, 2020, 13 (05)
  • [10] Addressing imbalanced data classification with Cluster-Based Reduced Noise SMOTE
    Hemmatian, Javad
    Hajizadeh, Rassoul
    Nazari, Fakhroddin
    PLOS ONE, 2025, 20 (02):