Fast and accurate out-of-core PCA framework for large scale biobank data

被引:4
|
作者
Li, Zilong [1 ]
Meisner, Jonas [2 ,3 ]
Albrechtsen, Anders [1 ]
机构
[1] Univ Copenhagen, Dept Biol, Sect Computat & RNA Biol, DK-2200 Copenhagen, Denmark
[2] Copenhagen Univ Hosp, Mental Hlth Ctr Copenhagen, Biol & Precis Psychiat, DK-2100 Copenhagen, Denmark
[3] Univ Copenhagen, Novo Nord Fdn Ctr Prot Res, DK-2200 Copenhagen, Denmark
关键词
ALGORITHM; GENOME;
D O I
10.1101/gr.277525.122
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.
引用
收藏
页码:1599 / 1608
页数:10
相关论文
共 50 条
  • [1] FOG: A Fast Out-of-Core Graph Processing Framework
    Zhiyuan Shao
    Jian He
    Huiming Lv
    Hai Jin
    International Journal of Parallel Programming, 2017, 45 : 1259 - 1272
  • [2] FOG: A Fast Out-of-Core Graph Processing Framework
    Shao, Zhiyuan
    He, Jian
    Lv, Huiming
    Jin, Hai
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2017, 45 (06) : 1259 - 1272
  • [3] OMR: Out-of-Core MapReduce for Large Data Sets
    Kaur, Gurneet
    Vora, Keval
    Koduru, Sai Charan
    Gupta, Rajiv
    PROCEEDINGS OF THE 2018 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON MEMORY MANAGEMENT (ISMM'18), 2018, : 71 - 83
  • [4] OMR: Out-of-Core MapReduce for Large Data Sets
    Kaur, Gurneet
    Vora, Keval
    Koduru, Sai Charan
    Gupta, Rajiv
    ACM SIGPLAN NOTICES, 2018, 53 (05) : 71 - 83
  • [5] An efficient GPU out-of-core framework for interactive rendering of large-scale CAD models
    Xue, Junjie
    Zhao, Gang
    Xiao, Wenlei
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2016, 27 (3-4) : 231 - 240
  • [6] Out-of-Core Assessment of Clustering Tendency for Large Data Sets
    Pakhira, Malay K.
    2010 IEEE 2ND INTERNATIONAL ADVANCE COMPUTING CONFERENCE, 2010, : 29 - 33
  • [7] Using an out-of-core technique for clustering large data sets
    Masciari, E
    Pizzuti, C
    Raimondo, G
    Talia, D
    12TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2001, : 133 - 137
  • [8] Large out-of-core tetrahedral meshing
    Alleaume, Aurelien
    Francez, Laurent
    Loriot, Alark
    Maman, Nathan
    PROCEEDINGS OF THE 16TH INTERNATIONAL MESHING ROUNDTABLE, 2008, : 461 - +
  • [9] A LOD Algorithm Based on Out-of-Core for Large Scale Terrain Rendering
    Zhang, Zhifeng
    Zhang, Na
    PROCEEDINGS 2013 INTERNATIONAL CONFERENCE ON MECHATRONIC SCIENCES, ELECTRIC ENGINEERING AND COMPUTER (MEC), 2013, : 2168 - 2171
  • [10] An efficient method for very large scale out-of-core terrain visualization
    Zhang, Huijie
    Sun, Jigui
    Yu, Haihong
    Qi, Changsong
    ICAT 2006: 16TH INTERNATIONAL CONFERENCE ON ARTIFICIAL REALITY AND TELEXISTENCE - WORSHOPS, PROCEEDINGS, 2006, : 36 - 41