Hellinger distance decision trees for PU learning in imbalanced data sets

被引:7
作者
Vazquez, Carlos Ortega [1 ]
vanden Broucke, Seppe [1 ,2 ]
De Weerdt, Jochen [1 ]
机构
[1] Katholieke Univ Leuven, Fac Econ & Business, Res Ctr Informat Syst Engn, Leuven, Belgium
[2] Univ Ghent, Fac Econ & Business Adm, Dept Business Informat & Operat Management, Ghent, Belgium
关键词
PU Learning; Weakly supervised learning; Imbalanced classification; Ensemble learning; SUPERVISED AUC OPTIMIZATION; CLASSIFICATION; ALGORITHMS; SMOTE; SVM;
D O I
10.1007/s10994-023-06323-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from positive and unlabeled data, or PU learning, is the setting in which a binary classifier can only train from positive and unlabeled instances, the latter containing both positive as well as negative instances. Many PU applications, e.g., fraud detection, are also characterized by class imbalance, which creates a challenging setting. Not only are fewer minority class examples compared to the case where all labels are known, there is also only a small fraction of unlabeled observations that would actually be positive. Despite the relevance of the topic, only a few studies have considered a class imbalance setting in PU learning. In this paper, we propose a novel technique that can directly handle imbalanced PU data, named the PU Hellinger Decision Tree (PU-HDT). Our technique exploits the class prior to estimate the counts of positives and negatives in every node in the tree. Moreover, the Hellinger distance is used instead of more conventional splitting criteria because it has been shown to be class-imbalance insensitive. This simple yet effective adaptation allows PU-HDT to perform well in highly imbalanced PU data sets. We also introduce PU Stratified Hellinger Random Forest (PU-SHRF), which uses PU-HDT as its base learner and integrates a stratified bootstrap sampling. Our empirical analysis shows that PU-SHRF substantially outperforms state-of-the-art PU learning methods for imbalanced data sets in most experimental settings.
引用
收藏
页码:4547 / 4578
页数:32
相关论文
共 65 条
[1]  
Akash PS, 2019, PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P1967
[2]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[3]  
[Anonymous], 2003, P 20 INT C INT C MAC
[4]   robROSE: A robust approach for dealing with imbalanced data in fraud detection [J].
Baesens, Bart ;
Hoeppner, Sebastiaan ;
Ortner, Irene ;
Verdonck, Tim .
STATISTICAL METHODS AND APPLICATIONS, 2021, 30 (03) :841-861
[5]   Example-dependent cost-sensitive decision trees [J].
Bahnsen, Alejandro Correa ;
Aouada, Djamila ;
Ottersten, Bjoern .
EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (19) :6609-6619
[6]   MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].
Barua, Sukarna ;
Islam, Md. Monirul ;
Yao, Xin ;
Murase, Kazuyuki .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425
[7]   Beyond the Selected Completely at Random Assumption for Learning from Positive and Unlabeled Data [J].
Bekker, Jessa ;
Robberechts, Pieter ;
Davis, Jesse .
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT II, 2020, 11907 :71-85
[8]   Learning from positive and unlabeled data: a survey [J].
Bekker, Jessa ;
Davis, Jesse .
MACHINE LEARNING, 2020, 109 (04) :719-760
[9]  
Bekker J, 2018, AAAI CONF ARTIF INTE, P2712
[10]   Building text classifiers using positive and unlabeled examples [J].
Bing, L ;
Yang, D ;
Li, XL ;
Lee, WS ;
Yu, PS .
THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, :179-186