Mixed Data Imputation Using Generative Adversarial Networks

被引:8
作者
Khan, Wasif [1 ]
Zaki, Nazar [1 ]
Ahmad, Amir [2 ]
Masud, Mohammad Mehedy [2 ]
Ali, Luqman [1 ]
Ali, Nasloon [3 ]
Ahmed, Luai A. [3 ,4 ]
机构
[1] United Arab Emirates Univ, Coll Informat Technol, Dept Comp Sci & Software Engn, Al Ain, U Arab Emirates
[2] United Arab Emirates Univ, Coll Informat Technol, Dept Informat Syst & Secur, Al Ain, U Arab Emirates
[3] United Arab Emirates Univ, Inst Publ Hlth, Coll Med & Hlth Sci, Al Ain, U Arab Emirates
[4] United Arab Emirates Univ, Zayed Ctr Hlth Sci, Al Ain, U Arab Emirates
关键词
Training data; Generative adversarial networks; Generators; Statistics; Data models; Machine learning algorithms; Prediction algorithms; Sequential analysis; Multivariate regression; Mixed data imputation; missing data; GANs; miss forest; MICE; denoising auto encoders; MISSING-DATA IMPUTATION; PREDICTION; MODEL;
D O I
10.1109/ACCESS.2022.3218067
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Missing values are common in real-world datasets and pose a significant challenge to the performance of statistical and machine learning models. Generally, missing values are imputed using statistical methods, such as the mean, median, mode, or machine learning approaches. These approaches are limited to either numerical or categorical data. Imputation in mixed datasets that contain both numerical and categorical attributes is challenging and has received little attention. Machine learning-based imputation algorithms usually require a large amount of training data. However, obtaining such data is difficult. Furthermore, no considerate work has been conducted in the literature that focuses on the effects of the training and testing size with increasing amounts of missing data. To address this gap, we proposed that increasing the amount of training data will improve imputation performance. We first used generative adversarial network (GAN) methods to increase the amount of training data. We considered two state-of-the-art GANs (tabular and conditional tabular) to add synthetic samples using observed data with different synthetic sample ratios. We then used three state-of-the-art imputation models that can handle mixed data: MissForest, multivariate imputation by chained equations, and denoising auto encoder (DAE). We proposed robust experimental setups on four publicly available datasets with different training-testing data divisions that have increasing missingness ratios. Extensive experimental results show that incorporating synthetic samples with training data achieves better performance compared to the baseline methods for mixed data imputation in both categorical and numerical variables, especially for large missingness ratios.
引用
收藏
页码:124475 / 124490
页数:16
相关论文
共 67 条
[1]   initKmix-A novel initial partition generation algorithm for clustering mixed data using k-means-based clustering [J].
Ahmad, Amir ;
Khan, Shehroz S. .
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 167
[2]   Survey of State-of-the-Art Mixed Data Clustering Algorithms [J].
Ahmad, Amir ;
Khan, Shehroz S. .
IEEE ACCESS, 2019, 7 :31883-31902
[3]  
Aleryani A., 2020, SN Comput. Sci, V1, P1, DOI [10.1007/s42979-020-00131-0, DOI 10.1007/S42979-020-00131-0, https://doi.org/10.1007/s42979-020-00131-0]
[4]   Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018) [J].
Alsaber, Ahmad R. ;
Pan, Jiazhu ;
Al-Hurban, Adeeba .
INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2021, 18 (03) :1-26
[5]  
[Anonymous], Cooperative Election Study
[6]   EvoImputer: An evolutionary approach for Missing Data Imputation and feature selection in the context of supervised learning [J].
Awawdeh, Shatha ;
Faris, Hossam ;
Hiary, Hazem .
KNOWLEDGE-BASED SYSTEMS, 2022, 236
[7]   Multiple imputation by chained equations: what is it and how does it work? [J].
Azur, Melissa J. ;
Stuart, Elizabeth A. ;
Frangakis, Constantine ;
Leaf, Philip J. .
INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, 2011, 20 (01) :40-49
[8]  
Bertsimas D, 2018, J MACH LEARN RES, V18
[9]  
Bowles C, 2018, Arxiv, DOI arXiv:1810.10863
[10]   Imputation of missing data with neural networks for classification [J].
Choudhury, Suyra Jyoti ;
Pal, Nikhil R. .
KNOWLEDGE-BASED SYSTEMS, 2019, 182