Merits of Bayesian networks in overcoming small data challenges: a meta-model for handling missing data

被引:6
作者
Ameur, Hanen [1 ,2 ]
Njah, Hasna [1 ,3 ]
Jamoussi, Salma [1 ,2 ]
机构
[1] Multimedia InfoRmat Syst & Adv Comp Lab, Sfax, Tunisia
[2] Univ Sfax, Higher Inst Comp Sci & Multimedia, Sfax, Tunisia
[3] Univ Gabes, Higher Inst Comp Sci & Multimedia, Gabes, Tunisia
关键词
Bayesian network; Missing data; Ensemble learning; Structure fusion; Small data; Imbalance data; Data generation; INDUCTION; SMOTE;
D O I
10.1007/s13042-022-01577-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The abundant availability of data in Big Data era has helped achieving significant advances in the machine learning field. However, many datasets appear with incompleteness from different perspectives such as values, labels, annotations and records. By discarding the records yielding ambiguousness, the exploitable data settles down to a small, sometimes ineffective, portion. Making the most of this small portion is burdensome because it usually yields overfitted models. In this paper we propose a new taxonomy for data missingness, in the machine learning context, along with a new metamodel to address the missing data problem within real and open data. Our proposed methodology relies on a H2S Kernel whose ultimate goal is the effective learning of a generalized Bayesian network from small input datasets. Our contributions are motivated by the strong probabilistic foundation of the Bayesian network, on the one hand, and on the ensemble learning effectiveness, on the other hand. The highlights of our kernel are the new strategy for multiple Bayesian network structure learning and the novel technique for the weighted fusion of Bayesian network structures. To harness on the richness of the merged network in terms of knowledge, we propose four H2S-derived systems to address the missing values/records impacts involving the annotation, the balancing, missing values imputation and data over-sampling. We combine these systems into a meta-model, and we perform a step-by-step experimental study. The obtained results showcase the efficiency of our contributions to deal with multi-class problems and with extremely small datasets.
引用
收藏
页码:229 / 251
页数:23
相关论文
共 83 条
  • [1] Approximate Bayesian computation for forward modeling in cosmology
    Akeret, Joel
    Refregier, Alexandre
    Amara, Adam
    Seehars, Sebastian
    Hasner, Caspar
    [J]. JOURNAL OF COSMOLOGY AND ASTROPARTICLE PHYSICS, 2015, (08):
  • [2] [Anonymous], 2009, Artificial Intelligence and Statistics
  • [3] [Anonymous], 2005, MACHINE LEARNING
  • [4] Decision tree induction based on minority entropy for the class imbalance problem
    Boonchuay, Kesinee
    Sinapiromsaran, Krung
    Lursinsap, Chidchanok
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2017, 20 (03) : 769 - 782
  • [5] Carvalho AM, 2009, 1 INS
  • [6] Novel Cost-Sensitive Approach to Improve the Multilayer Perceptron Performance on Imbalanced Data
    Castro, Cristiano L.
    Braga, Antonio P.
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2013, 24 (06) : 888 - 899
  • [7] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794
  • [8] A synthetic neighborhood generation based ensemble learning for the imbalanced data classification
    Chen, Zhi
    Lin, Tao
    Xia, Xin
    Xu, Hongyan
    Ding, Sha
    [J]. APPLIED INTELLIGENCE, 2018, 48 (08) : 2441 - 2457
  • [9] Collobert R, 2011, J MACH LEARN RES, V12, P2493
  • [10] THE COMPUTATIONAL-COMPLEXITY OF PROBABILISTIC INFERENCE USING BAYESIAN BELIEF NETWORKS
    COOPER, GF
    [J]. ARTIFICIAL INTELLIGENCE, 1990, 42 (2-3) : 393 - 405