Shape Penalized Decision Forests for Imbalanced Data Classification

被引：0

作者：

Goswami, Rahul ^{[1
,2
]}

Garai, Aindrila ^{[3
]}

Sadhukhan, Payel ^{[4
]}

Ghosh, Palash ^{[1
]}

Chakraborty, Tanujit ^{[2
,5
]}

机构：

[1] Indian Inst Technol, Dept Math, Gauhati 781039, India

[2] Sorbonne Univ Abu Dhabi, SAFIR, Abu Dhabi, U Arab Emirates

[3] Univ Bristol, Sch Math, Bristol BS8 1TR, England

[4] Techno Main, Dept Comp Sci & Engn IoT, Kolkata 700091, India

[5] Sorbonne Univ, Sorbonne Ctr Artificial Intelligence, F-75005 Paris, France

来源：

IEEE ACCESS | 2025年 / 13卷

关键词：

Forests; Shape; Training; Classification algorithms; Machine learning algorithms; Ensemble learning; Boosting; Robustness; Computational modeling; Adaptation models; Imbalanced data; ensemble learning; surface-to-volume ratio; tabular data; computational complexity; MACHINE; SMOTE;

D O I：

10.1109/ACCESS.2025.3569523

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Class imbalance poses a critical challenge in binary classification problems, particularly when rare but significant events are underrepresented in the training set. While traditional machine learning models and modern deep learning techniques struggle with such imbalances, decision trees and random forests combined with data sampling strategies have shown effectiveness, especially for tabular datasets. However, undersampling and oversampling approaches often introduce complexity and risk the loss of valuable information. This paper introduces Shape Penalized Decision Forests, a novel classifier tailored for imbalanced binary classification. Our method integrates a penalty on the surface-to-volume ratio of decision sets within decision tree construction, thereby inherently addressing class imbalance without additional sampling. The proposed approach enhances predictive performance and generalization by leveraging ensemble learning strategies such as bagging and adaptive boosting. We evaluate the method on twenty benchmark tabular imbalanced datasets, spanning diverse sample sizes and imbalance ratios, and demonstrate its superiority over several state-of-the-art data-level and algorithmic-level methods. Furthermore, simulated datasets with visually interpretable structures showcase the model's generalization capacity. Statistical significance tests validate the robustness of our approach. Finally, we provide a Python package, 'imbalanced-spdf', offering an accessible implementation for practitioners and researchers.

引用

页码：86380 / 86395

页数：16

共 70 条

[1] A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework [J].

Aguiar, Gabriel ;

Krawczyk, Bartosz ;

Cano, Alberto .

MACHINE LEARNING, 2024, 113 (07) :4165-4243

[2]

Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255

[3] Imbalanced Data Problem in Machine Learning: A Review [J].

Altalhan, Manahel ;

Algarni, Abdulmohsen ;

Alouane, Monia Turki-Hadj .

IEEE ACCESS, 2025, 13 :13686-13699

[4] MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].

Barua, Sukarna ;

Islam, Md. Monirul ;

Yao, Xin ;

Murase, Kazuyuki .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425

[5]

Biau G, 2008, J MACH LEARN RES, V9, P2015

[6] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[7]

Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1007/BF00058655

[8]

Breiman L, 1984, Classification and Regression Trees, V1st, DOI 10.1201/9781315139470

[9] A systematic study of the class imbalance problem in convolutional neural networks [J].

Buda, Mateusz ;

Maki, Atsuto ;

Mazurowski, Maciej A. .

NEURAL NETWORKS, 2018, 106 :249-259

[10]

Chakraborty T, 2020, Commun. Statistics, Case Stud., Data Anal. Appl., V6, P123

← 1 2 3 4 5 6 7 →