CBReT: A Cluster-Based Resampling Technique for dealing with imbalanced data in code smell prediction

被引:3
|
作者
Thakur, Praveen Singh [1 ]
Jadeja, Mahipal [1 ]
Chouhan, Satyendra Singh [1 ]
机构
[1] MNIT, Dept CSE, Jaipur 302017, India
关键词
Code smell prediction; Imbalance learning; Oversampling; Software maintainability; Empirical study;
D O I
10.1016/j.knosys.2024.111390
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Code smell refers to substandard design patterns in software's source code that may lead to faults -prone implementation. Machine learning -based code smell prediction models suffer from data imbalance problems, i.e., one class contains significantly more instances than another. The existing oversampling approaches, such as SMOTE (Synthetic Minority Over -sampling Technique), have been used for balancing the code smell dataset by generating synthetic samples for the minority class. However, the distribution of classes of code smell datasets is overlapped; hence, randomly generated instances can damage the decision boundary between both classes. This paper addresses this issue and proposes a novel Cluster -Based Resampling Technique, CBReT, that generates synthetic instances by considering the distribution of the code smell data. The CBReT first formulates clusters (containing minority and majority instances) based on the data distribution using Gaussian Mixture Model (GMM). Next, each cluster is balanced separately by synthesizing minority instances. While balancing the clusters, the CBReT also checks the validity of the synthetic instances so that each synthetic instance holds similar properties as the other minority instances. To assess the performance of CBReT, extensive experiments have been conducted on the four publicly available benchmark code smell datasets. We have used various performance metrics to evaluate our model's performance. The experimental results show that the CBReT technique significantly increased the performance of the code smell prediction model by 0.18% (min) and 9.08% (max) compared to the state-of-the-art imbalance learning approaches.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Cluster-Based Minority Over-Sampling for Imbalanced Datasets
    Puntumapon, Kamthorn
    Rakthamamon, Thanawin
    Waiyamai, Kitsana
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (12): : 3101 - 3109
  • [22] AN ENERGY EFFICIENT HYBRID CODE COMBINING TECHNIQUE FOR CLUSTER-BASED COOPERATIVE WIRELESS NETWORKS
    Sundari, P. Gnana
    Rani, K. Sheela Sobana
    Nagarajan, N.
    FOURTH INTERNATIONAL CONFERENCE ON COMPUTER AND ELECTRICAL ENGINEERING (ICCEE 2011), 2011, : 165 - 172
  • [23] A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data
    Amir Reza Salehi
    Majid Khedmati
    Scientific Reports, 14
  • [24] A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data
    Salehi, Amir Reza
    Khedmati, Majid
    SCIENTIFIC REPORTS, 2024, 14 (01)
  • [25] A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data
    Xu, Zhaozhao
    Shen, Derong
    Nie, Tiezheng
    Kou, Yue
    Yin, Nan
    Han, Xi
    INFORMATION SCIENCES, 2021, 572 : 574 - 589
  • [26] A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem
    Terzi, Duygu Sinanc
    Sagiroglu, Seref
    APPLIED COMPUTER SYSTEMS, 2019, 24 (02) : 104 - 110
  • [27] Data Aggregation with Spatially Correlated Grouping Technique on Cluster-based WSNs
    Cho, Chuan-Yu
    Lin, Chun-Lung
    Hsiao, Yu-Hung
    Wang, Jia-Shung
    Yang, Kai-Chao
    2010 FOURTH INTERNATIONAL CONFERENCE ON SENSOR TECHNOLOGIES AND APPLICATIONS (SENSORCOMM), 2008, : 584 - 589
  • [28] Imbalanced Data Classification Based on a Hybrid Resampling SVM Method
    Cao, Lu
    Zhai, Yikui
    IEEE 12TH INT CONF UBIQUITOUS INTELLIGENCE & COMP/IEEE 12TH INT CONF ADV & TRUSTED COMP/IEEE 15TH INT CONF SCALABLE COMP & COMMUN/IEEE INT CONF CLOUD & BIG DATA COMP/IEEE INT CONF INTERNET PEOPLE AND ASSOCIATED SYMPOSIA/WORKSHOPS, 2015, : 1533 - 1536
  • [29] Modeling and analysis of data prediction technique based on Linear Regression Model (DP-LRM) for cluster-based sensor networks
    Agarwal A.
    Jain K.
    Dev A.
    International Journal of Ambient Computing and Intelligence, 2021, 12 (04) : 98 - 117
  • [30] CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification
    Rayhan, Farshid
    Ahmed, Sajid
    Mahbub, Asif
    Jani, Md. Rafsan
    Shatabda, Swakkhar
    Farid, Dewan Md.
    2017 2ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTION (CSITSS-2017), 2017, : 70 - 75