Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure

被引:23
作者
Gauch, Hugh G., Jr. [1 ]
Qian, Sheng [2 ]
Piepho, Hans-Peter [2 ,3 ]
Zhou, Linda [1 ]
Chen, Rui [2 ]
机构
[1] Cornell Univ, Coll Agr & Life Sci, Soil & Crop Sci, Ithaca, NY 14853 USA
[2] Cornell Univ, Coll Agr & Life Sci, Biol Stat & Computat Biol, Ithaca, NY 14853 USA
[3] Univ Hohenheim, Inst Crop Sci, Biostat Unit, Stuttgart, Germany
来源
PLOS ONE | 2019年 / 14卷 / 06期
关键词
PRINCIPAL-COMPONENT ANALYSIS; GENETIC DIVERSITY; ASSOCIATION; EUROPE;
D O I
10.1371/journal.pone.0218306
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
SNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants, and microbes-especially to elucidate population structure. PCA is not a single method that is always done the same way, but rather requires three choices which we explore as a three-way factorial: two kinds of PCA graphs by three SNP codings by six PCA variants. Our main three recommendations are simple and easily implemented: Use PCA biplots, SNP coding 1 for the rare allele and 0 for the common allele, and double-centered PCA (or AMMI1 if main effects are also of interest). We also document contemporary practices by a literature survey of 125 representative articles that apply PCA to SNP data, find that virtually none implement our recommendations. The ultimate benefit from informed and optimal choices of PCA graph, SNP coding, and PCA variant, is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications.
引用
收藏
页数:26
相关论文
共 44 条
[1]   Fast Principal Component Analysis of Large-Scale Genome-Wide Data [J].
Abraham, Gad ;
Inouye, Michael .
PLOS ONE, 2014, 9 (04)
[2]   Identification of Distinct Breast Cancer Stem Cell Populations Based on Single-Cell Analyses of Functionally Enriched Stem and Progenitor Pools [J].
Akrap, Nina ;
Andersson, Daniel ;
Bom, Eva ;
Gregersson, Pernilla ;
Stahlberg, Anders ;
Landberg, Goeran .
STEM CELL REPORTS, 2016, 6 (01) :121-136
[3]  
[Anonymous], 1971, MATH ARCHAEOLOGICAL
[4]  
[Anonymous], 2011, UNDERSTANDING BIPLOT
[5]  
[Anonymous], 2005, A User's Guide to Principal Components
[6]  
[Anonymous], 1987, Introduction to Impact Engineering, DOI DOI 10.1007/978-94-009-3159-6
[7]  
[Anonymous], 2016, PRIMER APPL REGRESSI
[8]   TASSEL: software for association mapping of complex traits in diverse samples [J].
Bradbury, Peter J. ;
Zhang, Zhiwu ;
Kroon, Dallas E. ;
Casstevens, Terry M. ;
Ramdoss, Yogesh ;
Buckler, Edward S. .
BIOINFORMATICS, 2007, 23 (19) :2633-2635
[9]   Second-generation PLINK: rising to the challenge of larger and richer datasets [J].
Chang, Christopher C. ;
Chow, Carson C. ;
Tellier, Laurent C. A. M. ;
Vattikuti, Shashaank ;
Purcell, Shaun M. ;
Lee, James J. .
GIGASCIENCE, 2015, 4
[10]   Genetic Structure of the Han Chinese Population Revealed by Genome-wide SNP Variation [J].
Chen, Jieming ;
Zheng, Houfeng ;
Bei, Jin-Xin ;
Sun, Liangdan ;
Jia, Wei-hua ;
Li, Tao ;
Zhang, Furen ;
Seielstad, Mark ;
Zeng, Yi-Xin ;
Zhang, Xuejun ;
Liu, Jianjun .
AMERICAN JOURNAL OF HUMAN GENETICS, 2009, 85 (06) :775-785