Sparse Principal Component Analysis for Identifying Ancestry-Informative Markers in Genome-Wide Association Studies

被引:31
|
作者
Lee, Seokho [2 ]
Epstein, Michael P. [3 ]
Duncan, Richard [3 ]
Lin, Xihong [1 ]
机构
[1] Harvard Univ, Sch Publ Hlth, Dept Biostat, Boston, MA 02115 USA
[2] Hankuk Univ Foreign Studies, Dept Stat, Yongin, South Korea
[3] Emory Univ, Sch Med, Dept Human Genet, Atlanta, GA USA
基金
新加坡国家研究基金会; 美国国家卫生研究院;
关键词
ancestry-informative markers; genome-wide association studies; population stratification; principal component analysis; variable selection; POPULATION STRATIFICATION; SEMIPARAMETRIC TEST; ADMIXTURE; PANEL; MAP;
D O I
10.1002/gepi.21621
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Genome-Wide association studies (GWAS) routinely apply principal component analysis (PCA) to infer population structure within a sample to correct for confounding due to ancestry. GWAS implementation of PCA uses tens of thousands of single-nucleotide polymorphisms (SNPs) to infer structure, despite the fact that only a small fraction of such SNPs provides useful information on ancestry. The identification of this reduced set of Ancestry-Informative markers (AIMs) from a GWAS has practical value; for example, researchers can genotype the AIM set to correct for potential confounding due to ancestry in follow-up studies that utilize custom SNP or sequencing technology. We propose a novel technique to identify AIMs from Genome-Wide SNP data using sparse PCA. The procedure uses penalized regression methods to identify those SNPs in a Genome-Wide panel that significantly contribute to the principal components while encouraging SNPs that provide negligible loadings to vanish from the analysis. We found that sparse PCA leads to negligible loss of ancestry information compared to traditional PCA analysis of Genome-Wide SNP data. We further demonstrate the value of sparse PCA for AIM selection using real data from the International HapMap Project and a Genome-Wide study of inflammatory bowel disease. We have implemented our approach in open-source R software for public use. Genet. Epidemiol. 36:293-302, 2012. (c) 2012 Wiley Periodicals, Inc.
引用
收藏
页码:293 / 302
页数:10
相关论文
共 50 条
  • [1] Federated Principal Component Analysis for Genome-Wide Association Studies
    Hartebrodt, Anne
    Nasirigerdeh, Reza
    Blumenthal, David B.
    Rottger, Richard
    2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 1090 - 1095
  • [2] Ancestry informative markers for distinguishing between Thai populations based on genome-wide association datasets
    Vongpaisarnsin, Komkiat
    Listman, Jennifer Beth
    Malison, Robert T.
    Gelernter, Joel
    LEGAL MEDICINE, 2015, 17 (04) : 245 - 250
  • [3] Principal Component Analysis Characterizes Shared Pathogenetics from Genome-Wide Association Studies
    Chang, Diana
    Keinan, Alon
    PLOS COMPUTATIONAL BIOLOGY, 2014, 10 (09)
  • [4] Supervised categorical principal component analysis for genome-wide association analyses
    Meng Lu
    Hye-Seung Lee
    David Hadley
    Jianhua Z Huang
    Xiaoning Qian
    BMC Genomics, 15
  • [5] Supervised categorical principal component analysis for genome-wide association analyses
    Lu, Meng
    Lee, Hye-Seung
    Hadley, David
    Huang, Jianhua Z.
    Qian, Xiaoning
    BMC GENOMICS, 2014, 15 : 1 - 10
  • [6] Maximizing the Power of Principal-Component Analysis of Correlated Phenotypes in Genome-wide Association Studies
    Aschard, Hugues
    Vilhjalmsson, Bjarni J.
    Greliche, Nicolas
    Morange, Pierre-Emmanuel
    Tregouet, David-Alexandre
    Kraft, Peter
    AMERICAN JOURNAL OF HUMAN GENETICS, 2014, 94 (05) : 662 - 676
  • [7] Developing a novel panel of genome-wide ancestry informative markers for bio-geographical ancestry estimates
    Jia, Jing
    Wei, Yi-Liang
    Qin, Cui-Jiao
    Hu, Lan
    Wan, Li-Hua
    Li, Cai-Xia
    FORENSIC SCIENCE INTERNATIONAL-GENETICS, 2014, 8 (01) : 187 - 194
  • [8] Development of a Panel of Genome-Wide Ancestry Informative Markers to Study Admixture Throughout the Americas
    Galanter, Joshua Mark
    Carlos Fernandez-Lopez, Juan
    Gignoux, Christopher R.
    Barnholtz-Sloan, Jill
    Fernandez-Rozadilla, Ceres
    Via, Marc
    Hidalgo-Miranda, Alfredo
    Contreras, Alejandra V.
    Uribe Figueroa, Laura
    Raska, Paola
    Jimenez-Sanchez, Gerardo
    Silva Zolezzi, Irma
    Torres, Maria
    Ruiz Ponte, Clara
    Ruiz, Yarimar
    Salas, Antonio
    Nguyen, Elizabeth
    Eng, Celeste
    Borjas, Lisbeth
    Zabala, William
    Barreto, Guillermo
    Rondon Gonzalez, Fernando
    Ibarra, Adriana
    Taboada, Patricia
    Porras, Liliana
    Moreno, Fabian
    Bigham, Abigail
    Gutierrez, Gerardo
    Brutsaert, Tom
    Leon-Velarde, Fabiola
    Moore, Lorna G.
    Vargas, Enrique
    Cruz, Miguel
    Escobedo, Jorge
    Rodriguez-Santana, Jose
    Rodriguez-Cintron, William
    Chapela, Rocio
    Ford, Jean G.
    Bustamante, Carlos
    Seminara, Daniela
    Shriver, Mark
    Ziv, Elad
    Burchard, Esteban Gonzalez
    Haile, Robert
    Parra, Esteban
    Carracedo, Angel
    PLOS GENETICS, 2012, 8 (03):
  • [9] Genome-wide association studies combined with k-fold cross-validation identify rs17822931 as an ancestry-informative marker in Han Chinese population
    Li, Zheng
    Wu, Jiayi
    Yang, Jiawen
    Li, Kai
    Chen, Ji
    Huang, Shuainan
    Ji, Qiang
    Kong, Xiaochao
    Xie, Sumei
    Zhan, Wenxuan
    Zhang, Beilei
    Ye, Ke
    Liu, Qingfan
    Mao, Zhengsheng
    Cao, Yue
    Huang, Huijie
    Yu, Youjia
    Wang, Kang
    Yu, Yanfang
    Li, Ding
    Chen, Feng
    Chen, Peng
    ELECTROPHORESIS, 2023, 44 (15-16) : 1187 - 1196
  • [10] Microsatellite markers for genome-wide association studies
    Bahram, Seiamak
    Inoko, Hidetoshi
    NATURE REVIEWS GENETICS, 2007, 8 (02)