High-Dimensional Overdispersed Generalized Factor Model With Application to Single-Cell Sequencing Data Analysis

被引:0
作者
Nie, Jinyu [1 ,2 ]
Qin, Zhilong [3 ]
Liu, Wei [4 ]
机构
[1] Southwestern Univ Finance & Econ, Ctr Stat Res, Chengdu, Peoples R China
[2] Southwestern Univ Finance & Econ, Sch Stat, Chengdu, Peoples R China
[3] Southwestern Univ Finance & Econ, Inst Western China Econ Res, Chengdu, Peoples R China
[4] Sichuan Univ, Sch Math, Chengdu, Peoples R China
关键词
generalized factor model; high dimension; mixed-type data; overdispersion; variational EM; MAXIMUM-LIKELIHOOD; INFERENCE; NUMBER;
D O I
10.1002/sim.10213
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.
引用
收藏
页码:4836 / 4849
页数:14
相关论文
共 50 条
  • [31] Recent advances in functional data analysis and high-dimensional statistics
    Aneiros, German
    Cao, Ricardo
    Fraiman, Ricardo
    Genest, Christian
    Vieu, Philippe
    JOURNAL OF MULTIVARIATE ANALYSIS, 2019, 170 : 3 - 9
  • [32] FACTOR MODELLING FOR HIGH-DIMENSIONAL TIME SERIES: INFERENCE AND MODEL SELECTION
    Chan, Ngai Hang
    Lu, Ye
    Yau, Chun Yip
    JOURNAL OF TIME SERIES ANALYSIS, 2017, 38 (02) : 285 - 307
  • [33] Multiple imputation and analysis for high-dimensional incomplete proteomics data
    Yin, Xiaoyan
    Levy, Daniel
    Willinger, Christine
    Adourian, Aram
    Larson, Martin G.
    STATISTICS IN MEDICINE, 2016, 35 (08) : 1315 - 1326
  • [34] AMC: accurate mutation clustering from single-cell DNA sequencing data
    Yu, Zhenhua
    Du, Fang
    BIOINFORMATICS, 2022, 38 (06) : 1732 - 1734
  • [35] Forecasting high-dimensional realized volatility matrices using a factor model
    Shen, Keren
    Yao, Jianfeng
    Li, Wai Keung
    QUANTITATIVE FINANCE, 2020, 20 (11) : 1879 - 1887
  • [36] HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data
    Berge, Laurent
    Bouveyron, Charles
    Girard, Stephane
    JOURNAL OF STATISTICAL SOFTWARE, 2012, 46 (06): : 1 - 29
  • [37] Flexible clustering of high-dimensional data via mixtures of joint generalized hyperbolic distributions
    Tang, Yang
    Browne, Ryan R.
    McNicholas, Paul D.
    STAT, 2018, 7 (01):
  • [38] Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors
    Kuipers, Jack
    Jahn, Katharina
    Raphael, Benjamin J.
    Beerenwinkel, Niko
    GENOME RESEARCH, 2017, 27 (11) : 1885 - 1894
  • [39] IMPUTED FACTOR REGRESSION FOR HIGH-DIMENSIONAL BLOCK-WISE MISSING DATA
    Zhang, Yanqing
    Tang, Niansheng
    Qu, Annie
    STATISTICA SINICA, 2020, 30 (02) : 631 - 651
  • [40] A Deep Learning Algorithm for High-Dimensional Exploratory Item Factor Analysis
    Urban, Christopher J.
    Bauer, Daniel J.
    PSYCHOMETRIKA, 2021, 86 (01) : 1 - 29