Hellinger distance decision trees for PU learning in imbalanced data sets

被引:5
|
作者
Vazquez, Carlos Ortega [1 ]
vanden Broucke, Seppe [1 ,2 ]
De Weerdt, Jochen [1 ]
机构
[1] Katholieke Univ Leuven, Fac Econ & Business, Res Ctr Informat Syst Engn, Leuven, Belgium
[2] Univ Ghent, Fac Econ & Business Adm, Dept Business Informat & Operat Management, Ghent, Belgium
关键词
PU Learning; Weakly supervised learning; Imbalanced classification; Ensemble learning; SUPERVISED AUC OPTIMIZATION; CLASSIFICATION; ALGORITHMS; SMOTE; SVM;
D O I
10.1007/s10994-023-06323-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from positive and unlabeled data, or PU learning, is the setting in which a binary classifier can only train from positive and unlabeled instances, the latter containing both positive as well as negative instances. Many PU applications, e.g., fraud detection, are also characterized by class imbalance, which creates a challenging setting. Not only are fewer minority class examples compared to the case where all labels are known, there is also only a small fraction of unlabeled observations that would actually be positive. Despite the relevance of the topic, only a few studies have considered a class imbalance setting in PU learning. In this paper, we propose a novel technique that can directly handle imbalanced PU data, named the PU Hellinger Decision Tree (PU-HDT). Our technique exploits the class prior to estimate the counts of positives and negatives in every node in the tree. Moreover, the Hellinger distance is used instead of more conventional splitting criteria because it has been shown to be class-imbalance insensitive. This simple yet effective adaptation allows PU-HDT to perform well in highly imbalanced PU data sets. We also introduce PU Stratified Hellinger Random Forest (PU-SHRF), which uses PU-HDT as its base learner and integrates a stratified bootstrap sampling. Our empirical analysis shows that PU-SHRF substantially outperforms state-of-the-art PU learning methods for imbalanced data sets in most experimental settings.
引用
收藏
页码:4547 / 4578
页数:32
相关论文
共 50 条
  • [1] Hellinger Distance Trees for Imbalanced Streams
    Lyon, R. J.
    Brooke, J. M.
    Knowles, J. D.
    Stappers, B. W.
    2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 1969 - 1974
  • [2] One-against-all-based Hellinger distance decision tree for multiclass imbalanced learning
    Dong, Minggang
    Liu, Ming
    Jing, Chao
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2022, 23 (02) : 278 - 290
  • [3] Enhancing techniques for learning decision trees from imbalanced data
    Chaabane, Ikram
    Guermazi, Radhouane
    Hammami, Mohamed
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2020, 14 (03) : 677 - 745
  • [4] Hellinger Distance Weighted Ensemble for imbalanced data stream classification
    Grzyb, Joanna
    Klikowski, Jakub
    Wozniak, Michal
    JOURNAL OF COMPUTATIONAL SCIENCE, 2021, 51
  • [5] A two-step anomaly detection based method for PU classification in imbalanced data sets
    Vazquez, Carlos Ortega
    vanden Broucke, Seppe
    De Weerdt, Jochen
    DATA MINING AND KNOWLEDGE DISCOVERY, 2023, 37 (03) : 1301 - 1325
  • [6] A LEARNING METHOD FOR IMBALANCED DATA SETS
    de la Calleja, Jorge
    Fuentes, Olac
    Gonzalez, Jesus
    Aceves-Perez, Rita M.
    KDIR 2009: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND INFORMATION RETRIEVAL, 2009, : 307 - +
  • [7] Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data
    Garcia-Gil, Diego
    Garcia, Salvador
    Xiong, Ning
    Herrera, Francisco
    COGNITIVE COMPUTATION, 2024, 16 (04) : 1572 - 1588
  • [8] Safe Level OUPS for Improving Target Concept Learning in Imbalanced Data Sets
    Rivera, William A.
    Asparouhov, Ognian
    IEEE SOUTHEASTCON 2015, 2015,
  • [9] A two-step anomaly detection based method for PU classification in imbalanced data sets
    Carlos Ortega Vázquez
    Seppe vanden Broucke
    Jochen De Weerdt
    Data Mining and Knowledge Discovery, 2023, 37 : 1301 - 1325
  • [10] Improving SVM Classification on Imbalanced Data Sets in Distance Spaces
    Koeknar-Tezel, Suzan
    Latecki, Longin Jan
    2009 9TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2009, : 259 - +