Hellinger distance decision trees for PU learning in imbalanced data sets

被引:7
作者
Vazquez, Carlos Ortega [1 ]
vanden Broucke, Seppe [1 ,2 ]
De Weerdt, Jochen [1 ]
机构
[1] Katholieke Univ Leuven, Fac Econ & Business, Res Ctr Informat Syst Engn, Leuven, Belgium
[2] Univ Ghent, Fac Econ & Business Adm, Dept Business Informat & Operat Management, Ghent, Belgium
关键词
PU Learning; Weakly supervised learning; Imbalanced classification; Ensemble learning; SUPERVISED AUC OPTIMIZATION; CLASSIFICATION; ALGORITHMS; SMOTE; SVM;
D O I
10.1007/s10994-023-06323-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from positive and unlabeled data, or PU learning, is the setting in which a binary classifier can only train from positive and unlabeled instances, the latter containing both positive as well as negative instances. Many PU applications, e.g., fraud detection, are also characterized by class imbalance, which creates a challenging setting. Not only are fewer minority class examples compared to the case where all labels are known, there is also only a small fraction of unlabeled observations that would actually be positive. Despite the relevance of the topic, only a few studies have considered a class imbalance setting in PU learning. In this paper, we propose a novel technique that can directly handle imbalanced PU data, named the PU Hellinger Decision Tree (PU-HDT). Our technique exploits the class prior to estimate the counts of positives and negatives in every node in the tree. Moreover, the Hellinger distance is used instead of more conventional splitting criteria because it has been shown to be class-imbalance insensitive. This simple yet effective adaptation allows PU-HDT to perform well in highly imbalanced PU data sets. We also introduce PU Stratified Hellinger Random Forest (PU-SHRF), which uses PU-HDT as its base learner and integrates a stratified bootstrap sampling. Our empirical analysis shows that PU-SHRF substantially outperforms state-of-the-art PU learning methods for imbalanced data sets in most experimental settings.
引用
收藏
页码:4547 / 4578
页数:32
相关论文
共 50 条
[41]   Imbalanced Data Problem in Machine Learning: A Review [J].
Altalhan, Manahel ;
Algarni, Abdulmohsen ;
Alouane, Monia Turki-Hadj .
IEEE ACCESS, 2025, 13 :13686-13699
[42]   Classification with local clustering in imbalanced data sets [J].
Ji, Hua ;
Zhang, Huaxiang .
ADVANCED RESEARCH ON INFORMATION SCIENCE, AUTOMATION AND MATERIAL SYSTEM, PTS 1-6, 2011, 219-220 :151-155
[43]   An exploration of learning when data is noisy and imbalanced [J].
Van Hulse, Jason ;
Khoshgoftaar, Taghi M. ;
Napolitano, Amri .
INTELLIGENT DATA ANALYSIS, 2011, 15 (02) :215-236
[44]   Double-kernelized weighted broad learning system for imbalanced data [J].
Chen, Wuxing ;
Yang, Kaixiang ;
Zhang, Weiwen ;
Shi, Yifan ;
Yu, Zhiwen .
NEURAL COMPUTING & APPLICATIONS, 2022, 34 (22) :19923-19936
[45]   Fuzzy-Based Information Decomposition for Incomplete and Imbalanced Data Learning [J].
Liu, Shigang ;
Zhang, Jun ;
Xiang, Yang ;
Zhou, Wanlei .
IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2017, 25 (06) :1476-1490
[46]   Affinity and class probability-based fuzzy support vector machine for imbalanced data sets [J].
Tao, Xinmin ;
Li, Qing ;
Ren, Chao ;
Guo, Wenjie ;
He, Qing ;
Liu, Rui ;
Zou, Junrong .
NEURAL NETWORKS, 2020, 122 :289-307
[47]   Evaluation of the Classifiers in Multiparameter and Imbalanced Data Sets [J].
Piotrowska, Ewelina .
INFORMATION SYSTEMS ARCHITECTURE AND TECHNOLOGY, ISAT 2019, PT II, 2020, 1051 :263-273
[48]   On Validation Setup for Multiclass Imbalanced Data Sets [J].
Silva, Evandro J. R. ;
Zanchettin, Cleber .
PROCEEDINGS OF 2016 5TH BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS 2016), 2016, :468-473
[49]   Application of Parallel Distributed Genetics-based Machine Learning to Imbalanced Data Sets [J].
Nojima, Yusuke ;
Mihara, Shingo ;
Ishibuchi, Hisao .
2012 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2012,
[50]   Oversampling With Reliably Expanding Minority Class Regions for Imbalanced Data Learning [J].
Zhu, Tuanfei ;
Liu, Xinwang ;
Zhu, En .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (06) :6167-6181