Hellinger distance decision trees for PU learning in imbalanced data sets

被引:7
作者
Vazquez, Carlos Ortega [1 ]
vanden Broucke, Seppe [1 ,2 ]
De Weerdt, Jochen [1 ]
机构
[1] Katholieke Univ Leuven, Fac Econ & Business, Res Ctr Informat Syst Engn, Leuven, Belgium
[2] Univ Ghent, Fac Econ & Business Adm, Dept Business Informat & Operat Management, Ghent, Belgium
关键词
PU Learning; Weakly supervised learning; Imbalanced classification; Ensemble learning; SUPERVISED AUC OPTIMIZATION; CLASSIFICATION; ALGORITHMS; SMOTE; SVM;
D O I
10.1007/s10994-023-06323-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from positive and unlabeled data, or PU learning, is the setting in which a binary classifier can only train from positive and unlabeled instances, the latter containing both positive as well as negative instances. Many PU applications, e.g., fraud detection, are also characterized by class imbalance, which creates a challenging setting. Not only are fewer minority class examples compared to the case where all labels are known, there is also only a small fraction of unlabeled observations that would actually be positive. Despite the relevance of the topic, only a few studies have considered a class imbalance setting in PU learning. In this paper, we propose a novel technique that can directly handle imbalanced PU data, named the PU Hellinger Decision Tree (PU-HDT). Our technique exploits the class prior to estimate the counts of positives and negatives in every node in the tree. Moreover, the Hellinger distance is used instead of more conventional splitting criteria because it has been shown to be class-imbalance insensitive. This simple yet effective adaptation allows PU-HDT to perform well in highly imbalanced PU data sets. We also introduce PU Stratified Hellinger Random Forest (PU-SHRF), which uses PU-HDT as its base learner and integrates a stratified bootstrap sampling. Our empirical analysis shows that PU-SHRF substantially outperforms state-of-the-art PU learning methods for imbalanced data sets in most experimental settings.
引用
收藏
页码:4547 / 4578
页数:32
相关论文
共 50 条
[31]   Surrounding neighborhood-based SMOTE for learning from imbalanced data sets [J].
García, V. ;
Sánchez, J.S. ;
Martín-Félez, R. ;
Mollineda, R.A. .
Progress in Artificial Intelligence, 2012, 1 (04) :347-362
[32]   Learning from Imbalanced Data Sets with Weighted Cross-Entropy Function [J].
Aurelio, Yuri Sousa ;
de Almeida, Gustavo Matheus ;
de Castro, Cristiano Leite ;
Braga, Antonio Padua .
NEURAL PROCESSING LETTERS, 2019, 50 (02) :1937-1949
[33]   Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets [J].
Li, Der-Chiang ;
Hu, Susan C. ;
Lin, Liang-Sian ;
Yeh, Chun-Wu .
PLOS ONE, 2017, 12 (08)
[34]   A memetic approach for training set selection in imbalanced data sets [J].
Nikpour, Bahareh ;
Nezamabadi-pour, Hossein .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (11) :3043-3070
[35]   Shape Penalized Decision Forests for Imbalanced Data Classification [J].
Goswami, Rahul ;
Garai, Aindrila ;
Sadhukhan, Payel ;
Ghosh, Palash ;
Chakraborty, Tanujit .
IEEE ACCESS, 2025, 13 :86380-86395
[36]   ROAFS: INTERPRETABLE CLASSIFICATION OF IMBALANCED MEDICAL DATA BASED ON RANDOM OVERSAMPLING AND AFS DECISION TREES [J].
Tan, Xuli ;
Gong, Xun ;
Qin, Siyu ;
Li, Xinxin ;
Jia, Wenjuan .
MATHEMATICAL FOUNDATIONS OF COMPUTING, 2025,
[37]   Distance-based arranging oversampling technique for imbalanced data [J].
Dai, Qi ;
Liu, Jian-wei ;
Zhao, Jia-Liang .
NEURAL COMPUTING & APPLICATIONS, 2023, 35 (02) :1323-1342
[38]   Distance-based arranging oversampling technique for imbalanced data [J].
Qi Dai ;
Jian-wei Liu ;
Jia-Liang Zhao .
Neural Computing and Applications, 2023, 35 :1323-1342
[39]   A synthetic neighborhood generation based ensemble learning for the imbalanced data classification [J].
Chen, Zhi ;
Lin, Tao ;
Xia, Xin ;
Xu, Hongyan ;
Ding, Sha .
APPLIED INTELLIGENCE, 2018, 48 (08) :2441-2457
[40]   Imbalanced Data Problem in Machine Learning: A Review [J].
Altalhan, Manahel ;
Algarni, Abdulmohsen ;
Alouane, Monia Turki-Hadj .
IEEE ACCESS, 2025, 13 :13686-13699