Genomic data imputation with variational auto-encoders

被引:44
作者
Qiu, Yeping Lina [1 ,2 ]
Zheng, Hong [1 ]
Gevaert, Olivier [1 ,3 ]
机构
[1] Stanford Univ, Stanford Ctr Biomed Informat Res, Dept Med, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Elect Engn, Stanford, CA 94305 USA
[3] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94305 USA
来源
GIGASCIENCE | 2020年 / 9卷 / 08期
基金
美国国家卫生研究院;
关键词
imputation; variational auto-encoder; deep learning; MISSING VALUE IMPUTATION; AUTOENCODERS; NETWORK;
D O I
10.1093/gigascience/giaa082
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: As missing values are frequently present in genomic data, practical methods to handle missing data are necessary for downstream analyses that require complete data sets. State-of-the-art imputation techniques, including methods based on singular value decomposition and K-nearest neighbors, can be computationally expensive for large data sets and it is difficult to modify these algorithms to handle certain cases not missing at random. Results: In this work, we use a deep-learning framework based on the variational auto-encoder (VAE) for genomic missing value imputation and demonstrate its effectiveness in transcriptome and methylome data analysis. We show that in the vast majority of our testing scenarios, VAE achieves similar or better performances than the most widely used imputation standards, while having a computational advantage at evaluation time. When dealing with data missing not at random (e.g., few values are missing), we develop simple yet effective methodologies to leverage the prior knowledge about missing data. Furthermore, we investigate the effect of varying latent space regularization strength in VAE on the imputation performances and, in this context, show why VAE has a better imputation capacity compared to a regular deterministic auto-encoder. Conclusions: We describe a deep learning imputation framework for transcriptome and methylome data using a VAE and show that it can be a preferable alternative to traditional methods for data imputation, especially in the setting of large-scale data and certain missing-not-at-random scenarios.
引用
收藏
页数:12
相关论文
共 54 条
  • [1] The Ability of Different Imputation Methods to Preserve the Significant Genes and Pathways in Cancer
    Aghdam, Rosa
    Baghfalaki, Taban
    Khosravi, Pegah
    Ansari, Elnaz Saberi
    [J]. GENOMICS PROTEOMICS & BIOINFORMATICS, 2017, 15 (06) : 396 - 404
  • [2] [Anonymous], 1999, Imputing Missing Data for Gene Expression Arrays
  • [3] [Anonymous], ARXIV170300955
  • [4] DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data
    Arisdakessian, Cedric
    Poirion, Olivier
    Yunits, Breck
    Zhu, Xun
    Garmire, Lana X.
    [J]. GENOME BIOLOGY, 2019, 20 (01)
  • [5] Missing Value Imputation for RNA-Sequencing Data Using Statistical Models: A Comparative Study
    Taban Baghfalaki
    Mojtaba Ganjali
    Damon Berridge
    [J]. Journal of Statistical Theory and Applications, 2016, 15 (3): : 221 - 236
  • [6] Ballard D. H., 1987, P AAAI C ART INT, P279
  • [7] NCBI GEO: archive for functional genomics data sets-update
    Barrett, Tanya
    Wilhite, Stephen E.
    Ledoux, Pierre
    Evangelista, Carlos
    Kim, Irene F.
    Tomashevsky, Maxim
    Marshall, Kimberly A.
    Phillippy, Katherine H.
    Sherman, Patti M.
    Holko, Michelle
    Yefanov, Andrey
    Lee, Hyeseung
    Zhang, Naigong
    Robertson, Cynthia L.
    Serova, Nadezhda
    Davis, Sean
    Soboleva, Alexandra
    [J]. NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) : D991 - D995
  • [8] Beaulieu-Jones BK, 2017, BIOCOMPUT-PAC SYM, P207, DOI 10.1142/9789813207813_0021
  • [9] Burgess C. P., 2018, UNDERSTANDING DISENT
  • [10] Translating RNA sequencing into clinical diagnostics: opportunities and challenges
    Byron, Sara A.
    Van Keuren-Jensen, Kendall R.
    Engelthaler, David M.
    Carpten, John D.
    Craig, David W.
    [J]. NATURE REVIEWS GENETICS, 2016, 17 (05) : 257 - 271