Hellinger distance decision trees for PU learning in imbalanced data sets

被引:7
作者
Vazquez, Carlos Ortega [1 ]
vanden Broucke, Seppe [1 ,2 ]
De Weerdt, Jochen [1 ]
机构
[1] Katholieke Univ Leuven, Fac Econ & Business, Res Ctr Informat Syst Engn, Leuven, Belgium
[2] Univ Ghent, Fac Econ & Business Adm, Dept Business Informat & Operat Management, Ghent, Belgium
关键词
PU Learning; Weakly supervised learning; Imbalanced classification; Ensemble learning; SUPERVISED AUC OPTIMIZATION; CLASSIFICATION; ALGORITHMS; SMOTE; SVM;
D O I
10.1007/s10994-023-06323-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from positive and unlabeled data, or PU learning, is the setting in which a binary classifier can only train from positive and unlabeled instances, the latter containing both positive as well as negative instances. Many PU applications, e.g., fraud detection, are also characterized by class imbalance, which creates a challenging setting. Not only are fewer minority class examples compared to the case where all labels are known, there is also only a small fraction of unlabeled observations that would actually be positive. Despite the relevance of the topic, only a few studies have considered a class imbalance setting in PU learning. In this paper, we propose a novel technique that can directly handle imbalanced PU data, named the PU Hellinger Decision Tree (PU-HDT). Our technique exploits the class prior to estimate the counts of positives and negatives in every node in the tree. Moreover, the Hellinger distance is used instead of more conventional splitting criteria because it has been shown to be class-imbalance insensitive. This simple yet effective adaptation allows PU-HDT to perform well in highly imbalanced PU data sets. We also introduce PU Stratified Hellinger Random Forest (PU-SHRF), which uses PU-HDT as its base learner and integrates a stratified bootstrap sampling. Our empirical analysis shows that PU-SHRF substantially outperforms state-of-the-art PU learning methods for imbalanced data sets in most experimental settings.
引用
收藏
页码:4547 / 4578
页数:32
相关论文
共 50 条
[21]   Local ensemble learning from imbalanced and noisy data for word sense disambiguation [J].
Krawczyk, Bartosz ;
McInnes, Bridget T. .
PATTERN RECOGNITION, 2018, 78 :103-119
[22]   Annealing Genetic GAN for Imbalanced Web Data Learning [J].
Hao, Jingyu ;
Wang, Chengjia ;
Yang, Guang ;
Gao, Zhifan ;
Zhang, Jinglin ;
Zhang, Heye .
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 :1164-1174
[23]   Manifold regularized multiple kernel learning with Hellinger distance [J].
Yang, Tao ;
Fu, Dongmei ;
Li, Xiaogang ;
Riha, Kamil .
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 6) :13843-13851
[24]   A comparative study on noise filtering of imbalanced data sets [J].
Szeghalmy, Szilvia ;
Fazekas, Attila .
KNOWLEDGE-BASED SYSTEMS, 2024, 301
[25]   FUZZY AND SMOTE RESAMPLING TECHNIQUE FOR IMBALANCED DATA SETS [J].
Zorkeflee, Maisarah ;
Din, Aniza Mohamed ;
Ku-Mahamud, Ku Ruhana .
PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON COMPUTING & INFORMATICS, 2015, :638-643
[26]   Distance Metric Learning with Prototype Selection for Imbalanced Classification [J].
Luis Suarez, Juan ;
Garcia, Salvador ;
Herrera, Francisco .
HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2021, 2021, 12886 :391-402
[27]   RETRACTED: The Use of Hellinger Distance Undersampling Model to Improve the Classification of Disease Class in Imbalanced Medical Datasets (Retracted Article) [J].
Al-Shamaa, Zina Z. R. ;
Kurnaz, Sefer ;
Duru, Adil Deniz ;
Peppa, Nadia ;
Mirnezami, Alex H. ;
Hamady, Zaed Z. R. .
APPLIED BIONICS AND BIOMECHANICS, 2020, 2020
[28]   Mining and Integrating Reliable Decision Rules for Imbalanced Cancer Gene Expression Data Sets [J].
Hualong Yu ;
Jun Ni Yuanyuan Dan Sen Xu School of Computer Science and Engineering Jiangsu University of Science and Technology Zhenjiang China Department of Radiology Carver College of Medicine The University of Iowa Iowa City IA USA School of Biology and Chemical Engineering Jiangsu University of Science and Technology Zhenjiang China School of Information Engineering Yancheng Institute of Technology Yancheng China .
TsinghuaScienceandTechnology, 2012, 17 (06) :666-673
[29]   Mining and integrating reliable decision rules for imbalanced cancer gene expression data sets [J].
Yu, Hualong ;
Ni, Jun ;
Dan, Yuanyuan ;
Xu, Sen .
Tsinghua Science and Technology, 2012, 17 (06) :666-673
[30]   Box Drawings for Learning with Imbalanced Data [J].
Goh, Siong Thye ;
Rudin, Cynthia .
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, :333-342