Heterogeneous Large Datasets Integration Using Bayesian Factor Regression

被引:10
作者
Avalos-Pacheco, Alejandra [1 ,2 ]
Rossell, David [3 ]
Savage, Richard S. [2 ]
机构
[1] Harvard Med Sch, Harvard MIT Ctr Regulatory Sci, 210 Longwood Av, Boston, MA 02115 USA
[2] Univ Warwick, Dept Stat, Coventry CV4 7AL, W Midlands, England
[3] Univ Pompeu Fabra, Dept Business & Econ, Carrer Ramon Trias Fargas 25-27, Barcelona 08005, Spain
来源
BAYESIAN ANALYSIS | 2022年 / 17卷 / 01期
关键词
Bayesian factor analysis; EM; non-local priors; shrinkage; GENE-EXPRESSION; MICROARRAY DATA; VARIABLE SELECTION; NORMALIZATION; DECOMPOSITION; LIKELIHOOD; MODELS; CANCER;
D O I
10.1214/20-BA1240
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Two key challenges in modern statistical applications are the large amount of information recorded per individual, and that such data are often not collected all at once but in batches. These batch effects can be complex, causing distortions in both mean and variance. We propose a novel sparse latent factor regression model to integrate such heterogeneous data. The model provides a tool for data exploration via dimensionality reduction and sparse low-rank covariance estimation while correcting for a range of batch effects. We study the use of several sparse priors (local and non-local) to learn the dimension of the latent factors. We provide a flexible methodology for sparse factor regression which is not limited to data with batch effects. Our model is fitted in a deterministic fashion by means of an EM algorithm for which we derive closed-form updates, contributing a novel scalable algorithm for non-local priors of interest beyond the immediate scope of this paper. We present several examples, with a focus on bioinformatics applications. Our results show an increase in the accuracy of the dimensionality reduction, with non-local priors substantially improving the reconstruction of factor cardinality. The results of our analyses illustrate how failing to properly account for batch effects can result in unreliable inference. Our model provides a novel approach to latent factor regression that balances sparsity with sensitivity in scenarios both with and without batch effects and is highly computationally efficient.
引用
收藏
页码:33 / 66
页数:34
相关论文
共 69 条
  • [51] The Spike-and-Slab LASSO
    Rockova, Veronika
    George, Edward I.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2018, 113 (521) : 431 - 444
  • [52] Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity
    Rockova, Veronika
    George, Edward I.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2016, 111 (516) : 1608 - 1622
  • [53] EMVS: The EM Approach to Bayesian Variable Selection
    Rockova, Veronika
    George, Edward I.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2014, 109 (506) : 828 - 846
  • [54] Nonlocal Priors for High-Dimensional Estimation
    Rossell, David
    Telesca, Donatello
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (517) : 254 - 265
  • [55] Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data
    Schadt, EE
    Li, C
    Ellis, B
    Wong, WH
    [J]. JOURNAL OF CELLULAR BIOCHEMISTRY, 2001, 84 : 120 - 125
  • [56] Scherer, 2009, WILEY SERIES PROBABI, V34
  • [57] survcomp: an R/Bioconductor package for performance assessment and comparison of survival models
    Schroeder, Markus S.
    Culhane, Aedin C.
    Quackenbush, John
    Haibe-Kains, Benjamin
    [J]. BIOINFORMATICS, 2011, 27 (22) : 3206 - 3208
  • [58] ESTIMATING DIMENSION OF A MODEL
    SCHWARZ, G
    [J]. ANNALS OF STATISTICS, 1978, 6 (02) : 461 - 464
  • [59] Seber, 1984, MULTIVARIATE OBSERVA, DOI [10.1002/9780470316641.37, DOI 10.1002/9780470316641.37]
  • [60] Evaluating intensity normalization on MRIs of human brain with multiple sclerosis
    Shah, Mohak
    Xiao, Yiming
    Subbanna, Nagesh
    Francis, Simon
    Arnold, Douglas L.
    Collins, D. Louis
    Arbel, Tal
    [J]. MEDICAL IMAGE ANALYSIS, 2011, 15 (02) : 267 - 282