Heterogeneous Large Datasets Integration Using Bayesian Factor Regression

被引:10
作者
Avalos-Pacheco, Alejandra [1 ,2 ]
Rossell, David [3 ]
Savage, Richard S. [2 ]
机构
[1] Harvard Med Sch, Harvard MIT Ctr Regulatory Sci, 210 Longwood Av, Boston, MA 02115 USA
[2] Univ Warwick, Dept Stat, Coventry CV4 7AL, W Midlands, England
[3] Univ Pompeu Fabra, Dept Business & Econ, Carrer Ramon Trias Fargas 25-27, Barcelona 08005, Spain
来源
BAYESIAN ANALYSIS | 2022年 / 17卷 / 01期
关键词
Bayesian factor analysis; EM; non-local priors; shrinkage; GENE-EXPRESSION; MICROARRAY DATA; VARIABLE SELECTION; NORMALIZATION; DECOMPOSITION; LIKELIHOOD; MODELS; CANCER;
D O I
10.1214/20-BA1240
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Two key challenges in modern statistical applications are the large amount of information recorded per individual, and that such data are often not collected all at once but in batches. These batch effects can be complex, causing distortions in both mean and variance. We propose a novel sparse latent factor regression model to integrate such heterogeneous data. The model provides a tool for data exploration via dimensionality reduction and sparse low-rank covariance estimation while correcting for a range of batch effects. We study the use of several sparse priors (local and non-local) to learn the dimension of the latent factors. We provide a flexible methodology for sparse factor regression which is not limited to data with batch effects. Our model is fitted in a deterministic fashion by means of an EM algorithm for which we derive closed-form updates, contributing a novel scalable algorithm for non-local priors of interest beyond the immediate scope of this paper. We present several examples, with a focus on bioinformatics applications. Our results show an increase in the accuracy of the dimensionality reduction, with non-local priors substantially improving the reconstruction of factor cardinality. The results of our analyses illustrate how failing to properly account for batch effects can result in unreliable inference. Our model provides a novel approach to latent factor regression that balances sparsity with sensitivity in scenarios both with and without batch effects and is highly computationally efficient.
引用
收藏
页码:33 / 66
页数:34
相关论文
共 69 条
  • [1] Singular value decomposition for genome-wide expression data processing and modeling
    Alter, O
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) : 10101 - 10106
  • [2] Avalos-Pacheco A., 2020, BAYESIAN ANAL, DOI [10.1214/20-BA1240SUPP, DOI 10.1214/20-BA1240SUPP]
  • [3] Pollutants bioavailability and toxicological risk from microplastics to marine mussels
    Avio, Carlo Giacomo
    Gorbi, Stefania
    Milan, Massimo
    Benedetti, Maura
    Fattorini, Daniele
    d'Errico, Giuseppe
    Pauletto, Marianna
    Bargelloni, Luca
    Regoli, Francesco
    [J]. ENVIRONMENTAL POLLUTION, 2015, 198 : 211 - 222
  • [4] Bar H., 2018, ARXIV180309735, P1, DOI [10.1002/wics.1455.35, DOI 10.1002/WICS.1455.35]
  • [5] Adjustment of systematic microarray data biases
    Benito, M
    Parker, J
    Du, Q
    Wu, JY
    Xang, D
    Perou, CM
    Marron, JS
    [J]. BIOINFORMATICS, 2004, 20 (01) : 105 - 114
  • [6] Angiogenic mRNA and microRNA Gene Expression Signature Predicts a Novel Subtype of Serous Ovarian Cancer
    Bentink, Stefan
    Haibe-Kains, Benjamin
    Risch, Thomas
    Fan, Jian-Bing
    Hirsch, Michelle S.
    Holton, Kristina
    Rubio, Renee
    April, Craig
    Chen, Jing
    Wickham-Garcia, Eliza
    Liu, Joyce
    Culhane, Aedin
    Drapkin, Ronny
    Quackenbush, John
    Matulonis, Ursula A.
    [J]. PLOS ONE, 2012, 7 (02):
  • [7] Methods for the integration of multi-omics data: mathematical aspects
    Bersanelli, Matteo
    Mosca, Ettore
    Remondini, Daniel
    Giampieri, Enrico
    Sala, Claudia
    Castellani, Gastone
    Milanesi, Luciano
    [J]. BMC BIOINFORMATICS, 2016, 17
  • [8] Sparse Bayesian infinite factor models
    Bhattacharya, A.
    Dunson, D. B.
    [J]. BIOMETRIKA, 2011, 98 (02) : 291 - 306
  • [9] Dimension Reduction: A Guided Tour
    Burges, Christopher J. C.
    [J]. FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2010, 2 (04): : 275 - 365
  • [10] Dependency of Colorectal Cancer on a TGF-β-Driven Program in Stromal Cells for Metastasis Initiation
    Calon, Alexandre
    Espinet, Elisa
    Palomo-Ponce, Sergio
    Tauriello, Daniele V. F.
    Iglesias, Mar
    Virtudes Cespedes, Maria
    Sevillano, Marta
    Nadal, Cristina
    Jung, Peter
    Zhang, Xiang H. -F.
    Byrom, Daniel
    Riera, Antoni
    Rossell, David
    Mangues, Ramon
    Massague, Joan
    Sancho, Elena
    Batlle, Eduard
    [J]. CANCER CELL, 2012, 22 (05) : 571 - 584