Merits of Bayesian networks in overcoming small data challenges: a meta-model for handling missing data

被引:9
作者
Ameur, Hanen [1 ,2 ]
Njah, Hasna [1 ,3 ]
Jamoussi, Salma [1 ,2 ]
机构
[1] Multimedia InfoRmat Syst & Adv Comp Lab, Sfax, Tunisia
[2] Univ Sfax, Higher Inst Comp Sci & Multimedia, Sfax, Tunisia
[3] Univ Gabes, Higher Inst Comp Sci & Multimedia, Gabes, Tunisia
关键词
Bayesian network; Missing data; Ensemble learning; Structure fusion; Small data; Imbalance data; Data generation; INDUCTION; SMOTE;
D O I
10.1007/s13042-022-01577-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The abundant availability of data in Big Data era has helped achieving significant advances in the machine learning field. However, many datasets appear with incompleteness from different perspectives such as values, labels, annotations and records. By discarding the records yielding ambiguousness, the exploitable data settles down to a small, sometimes ineffective, portion. Making the most of this small portion is burdensome because it usually yields overfitted models. In this paper we propose a new taxonomy for data missingness, in the machine learning context, along with a new metamodel to address the missing data problem within real and open data. Our proposed methodology relies on a H2S Kernel whose ultimate goal is the effective learning of a generalized Bayesian network from small input datasets. Our contributions are motivated by the strong probabilistic foundation of the Bayesian network, on the one hand, and on the ensemble learning effectiveness, on the other hand. The highlights of our kernel are the new strategy for multiple Bayesian network structure learning and the novel technique for the weighted fusion of Bayesian network structures. To harness on the richness of the merged network in terms of knowledge, we propose four H2S-derived systems to address the missing values/records impacts involving the annotation, the balancing, missing values imputation and data over-sampling. We combine these systems into a meta-model, and we perform a step-by-step experimental study. The obtained results showcase the efficiency of our contributions to deal with multi-class problems and with extremely small datasets.
引用
收藏
页码:229 / 251
页数:23
相关论文
共 83 条
[1]   Approximate Bayesian computation for forward modeling in cosmology [J].
Akeret, Joel ;
Refregier, Alexandre ;
Amara, Adam ;
Seehars, Sebastian ;
Hasner, Caspar .
JOURNAL OF COSMOLOGY AND ASTROPARTICLE PHYSICS, 2015, (08)
[2]  
[Anonymous], 2002, INT C SYST BIOL
[3]  
[Anonymous], 2005, PROC 16 INT WORKSHOP
[4]  
[Anonymous], 2009, Encyclopedia of Complexity and Systems Science
[5]  
[Anonymous], 2005, Proc. International Conference on Machine Learning
[6]  
Ben-David S, 2009, P MACHINE LEARNING R, P25
[7]   Decision tree induction based on minority entropy for the class imbalance problem [J].
Boonchuay, Kesinee ;
Sinapiromsaran, Krung ;
Lursinsap, Chidchanok .
PATTERN ANALYSIS AND APPLICATIONS, 2017, 20 (03) :769-782
[8]  
Carvalho AM., 2009, SCORING FUNCTIONS LE
[9]   Novel Cost-Sensitive Approach to Improve the Multilayer Perceptron Performance on Imbalanced Data [J].
Castro, Cristiano L. ;
Braga, Antonio P. .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2013, 24 (06) :888-899
[10]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794