TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes

被引:18
作者
Bose, Aritra [1 ]
Kalantzis, Vassilis [2 ]
Kontopoulou, Eugenia-Maria [1 ]
Elkady, Mai [1 ]
Paschou, Peristera [3 ]
Drineas, Petros [1 ]
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[2] Thomas J Watson Res Ctr, IBM Res, Yorktown Hts, NY 10598 USA
[3] Purdue Univ, Dept Biol Sci, W Lafayette, IN 47907 USA
基金
美国国家科学基金会;
关键词
PRINCIPAL; STRATIFICATION;
D O I
10.1093/bioinformatics/btz157
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative. Results: We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task.
引用
收藏
页码:3679 / 3683
页数:5
相关论文
共 31 条
[1]   FlashPCA2: principal component analysis of Biobank-scale genotype datasets [J].
Abraham, Gad ;
Qiu, Yixuan ;
Inouye, Michael .
BIOINFORMATICS, 2017, 33 (17) :2776-2778
[2]   Fast Principal Component Analysis of Large-Scale Genome-Wide Data [J].
Abraham, Gad ;
Inouye, Michael .
PLOS ONE, 2014, 9 (04)
[3]   Fast model-based estimation of ancestry in unrelated individuals [J].
Alexander, David H. ;
Novembre, John ;
Lange, Kenneth .
GENOME RESEARCH, 2009, 19 (09) :1655-1664
[4]  
[Anonymous], 1999, LAPACK users' guide third
[5]  
Bose Aritra., 2017, bioRxiv
[6]  
Cann HM, 2002, SCIENCE, V296, P261
[7]   THE HISTORY AND GEOGRAPHY OF HUMAN GENES - CAVALLISFORZA,LL, MENOZZI,P, PIAZZA,A [J].
CHISHOLM, B .
JOURNAL OF ASIAN STUDIES, 1995, 54 (02) :490-492
[8]  
Drineas P., 2018, MATH DATA, P1, DOI DOI 10.1090/PCMS/025
[9]   STRUCTURAL CONVERGENCE RESULTS FOR APPROXIMATION OF DOMINANT SUBSPACES FROM BLOCK KRYLOV SPACES [J].
Drineas, Petros ;
Ipsen, Ilse C. F. ;
Kontopoulou, Eugenia-Maria ;
Magdon-Ismail, Malik .
SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS, 2018, 39 (02) :567-586
[10]   RandNLA: Randomized Numerical Linear Algebra [J].
Drineas, Petros ;
Mahoney, Michael W. .
COMMUNICATIONS OF THE ACM, 2016, 59 (06) :80-90