Differentiable and Scalable Generative Adversarial Models for Data Imputation

被引:4
作者
Wu, Yangyang [1 ]
Wang, Jun [2 ]
Miao, Xiaoye [1 ]
Wang, Wenjia [2 ]
Yin, Jianwei [3 ]
机构
[1] Zhejiang Univ, Ctr Data Sci, Hangzhou 310058, Peoples R China
[2] Hong Kong Univ Sci & Technol, Kowloon, Hong Kong, Peoples R China
[3] Zhejiang Univ, Coll Comp Sci, Ctr Data Sci, Hangzhou 310058, Peoples R China
关键词
Data imputation; generative adversarial network; large-scale incomplete data; EFFICIENT;
D O I
10.1109/TKDE.2023.3293129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data imputation has been extensively explored to solve the missing data problem. The dramatically increasing volume of incomplete data makes the imputation models computationally infeasible in many real-life applications. In this paper, we propose an effective scalable imputation system named SCIS to significantly speed up the training of the differentiable generative adversarial imputation models under accuracy-guarantees for large-scale incomplete data.SCIS consists of two modules, differentiable imputation modeling (DIM) and sample size estimation (SSE). DIM leverages a new masking Sinkhorn divergence function to make an arbitrary generative adversarial imputation model differentiable, while for such a differentiable imputation model, SSE can estimate an appropriate sample size to ensure the user-specified imputation accuracy of the final model. Moreover, SCIS can also accelerate the autoencoder based imputation models. Extensive experiments upon several real-life large-scale datasets demonstrate that, our proposed system can accelerate the generative adversarial model training by 6.23x. Using around 1.27% samples, SCIS yields competitive accuracy with the state-of-the-art imputation methods in much shorter computation time.
引用
收藏
页码:490 / 503
页数:14
相关论文
共 54 条
  • [1] AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION
    ALTMAN, NS
    [J]. AMERICAN STATISTICIAN, 1992, 46 (03) : 175 - 185
  • [2] Arjovsky M., 2017, arXiv, DOI DOI 10.48550/ARXIV.1701.04862
  • [3] Arjovsky M, 2017, PR MACH LEARN RES, V70
  • [4] Bellemare MG, 2017, Arxiv, DOI arXiv:1705.10743
  • [5] Biessmann F, 2019, J MACH LEARN RES, V20
  • [6] "Deep" Learning for Missing Value Imputation in Tables with Non-Numerical Data
    Biessmann, Felix
    Salinas, David
    Schelter, Sebastian
    Schmidt, Philipp
    Lange, Dustin
    [J]. CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, : 2017 - 2025
  • [7] Boris M., 2020, P INT C MACH LEARN, P1
  • [8] Ipsen NB, 2021, Arxiv, DOI [arXiv:2006.12871, 10.48550/arXiv.2006.12871, DOI 10.48550/ARXIV.2006.12871]
  • [9] Cao W, 2018, ADV NEUR IN, V31
  • [10] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794