Integrative analysis of individual-level data and high-dimensional summary statistics

被引:4
作者
Fu, Sheng [1 ]
Deng, Lu [2 ]
Zhang, Han [3 ]
Qin, Jing [4 ]
Yu, Kai [1 ,5 ]
机构
[1] NCI, Div Canc Epidemiol & Genet, Bethesda, MD 20892 USA
[2] Nankai Univ, Sch Stat & Data Sci, Tianjin 300071, Peoples R China
[3] Informat Management Serv Inc, Bethesda, MD 20892 USA
[4] NIAID, NIH, Bethesda, MD 20892 USA
[5] NCI, Div Canc Epidemiol & Genet, 9609 Med Ctr Dr, Bethesda, MD 20892 USA
基金
中国国家自然科学基金;
关键词
GENOME-WIDE ASSOCIATION; REGRESSION MODELS; INFORMATION; PREDICTION; DISEASE; SUSCEPTIBILITY; VARIANTS; RISK;
D O I
10.1093/bioinformatics/btad156
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Researchers usually conduct statistical analyses based on models built on raw data collected from individual participants (individual-level data). There is a growing interest in enhancing inference efficiency by incorporating aggregated summary information from other sources, such as summary statistics on genetic markers' marginal associations with a given trait generated from genome-wide association studies. However, combining high-dimensional summary data with individual-level data using existing integrative procedures can be challenging due to various numeric issues in optimizing an objective function over a large number of unknown parameters.Results: We develop a procedure to improve the fitting of a targeted statistical model by leveraging external summary data for more efficient statistical inference (both effect estimation and hypothesis testing). To make this procedure scalable to high-dimensional summary data, we propose a divide-and-conquer strategy by breaking the task into easier parallel jobs, each fitting the targeted model by integrating the individual-level data with a small proportion of summary data. We obtain the final estimates of model parameters by pooling results from multiple fitted models through the minimum distance estimation procedure. We improve the procedure for a general class of additive models commonly encountered in genetic studies. We further expand these two approaches to integrate individual-level and high-dimensional summary data from different study populations. We demonstrate the advantage of the proposed methods through simulations and an application to the study of the effect on pancreatic cancer risk by the polygenic risk score defined by BMI-associated genetic markers.
引用
收藏
页数:8
相关论文
共 33 条
[1]   Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer [J].
Amundadottir, Laufey ;
Kraft, Peter ;
Stolzenberg-Solomon, Rachael Z. ;
Fuchs, Charles S. ;
Petersen, Gloria M. ;
Arslan, Alan A. ;
Bueno-de-Mesquita, H. Bas ;
Gross, Myron ;
Helzlsouer, Kathy ;
Jacobs, Eric J. ;
LaCroix, Andrea ;
Zheng, Wei ;
Albanes, Demetrius ;
Bamlet, William ;
Berg, Christine D. ;
Berrino, Franco ;
Bingham, Sheila ;
Buring, Julie E. ;
Bracci, Paige M. ;
Canzian, Federico ;
Clavel-Chapelon, Francoise ;
Clipp, Sandra ;
Cotterchio, Michelle ;
de Andrade, Mariza ;
Duell, Eric J. ;
Fox, John W., Jr. ;
Gallinger, Steven ;
Gaziano, J. Michael ;
Giovannucci, Edward L. ;
Goggins, Michael ;
Gonzalez, Carlos A. ;
Hallmans, Goran ;
Hankinson, Susan E. ;
Hassan, Manal ;
Holly, Elizabeth A. ;
Hunter, David J. ;
Hutchinson, Amy ;
Jackson, Rebecca ;
Jacobs, Kevin B. ;
Jenab, Mazda ;
Kaaks, Rudolf ;
Klein, Alison P. ;
Kooperberg, Charles ;
Kurtz, Robert C. ;
Li, Donghui ;
Lynch, Shannon M. ;
Mandelson, Margaret ;
McWilliams, Robert R. ;
Mendelsohn, Julie B. ;
Michaud, Dominique S. .
NATURE GENETICS, 2009, 41 (09) :986-U47
[2]   LD Score regression distinguishes confounding from polygenicity in genome-wide association studies [J].
Bulik-Sullivan, Brendan K. ;
Loh, Po-Ru ;
Finucane, Hilary K. ;
Ripke, Stephan ;
Yang, Jian ;
Patterson, Nick ;
Daly, Mark J. ;
Price, Alkes L. ;
Neale, Benjamin M. .
NATURE GENETICS, 2015, 47 (03) :291-+
[3]   The UK Biobank resource with deep phenotyping and genomic data [J].
Bycroft, Clare ;
Freeman, Colin ;
Petkova, Desislava ;
Band, Gavin ;
Elliott, Lloyd T. ;
Sharp, Kevin ;
Motyer, Allan ;
Vukcevic, Damjan ;
Delaneau, Olivier ;
O'Connell, Jared ;
Cortes, Adrian ;
Welsh, Samantha ;
Young, Alan ;
Effingham, Mark ;
McVean, Gil ;
Leslie, Stephen ;
Allen, Naomi ;
Donnelly, Peter ;
Marchini, Jonathan .
NATURE, 2018, 562 (7726) :203-+
[4]   Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level Information From External Big Data Sources [J].
Chatterjee, Nilanjan ;
Chen, Yi-Hau ;
Maas, Paige ;
Carroll, Raymond J. .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2016, 111 (513) :107-117
[5]   Generalized linear models incorporating population level information: an empirical-likelihood-based approach [J].
Chaudhuri, Sanjay ;
Handcock, Mark S. ;
Rendall, Michael S. .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2008, 70 :311-328
[6]  
Chen JH, 1999, STAT SINICA, V9, P385
[7]   Informing a risk prediction model for binary outcomes with external coefficient information [J].
Cheng, Wenting ;
Taylor, Jeremy M. G. ;
Gu, Tian ;
Tomlins, Scott A. ;
Mukherjee, Bhramar .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2019, 68 (01) :121-139
[8]   Improving estimation and prediction in linear regression incorporating external information from an established reduced model [J].
Cheng, Wenting ;
Taylor, Jeremy M. G. ;
Vokonas, Pantel S. ;
Park, Sung Kyun ;
Mukherjee, Bhramar .
STATISTICS IN MEDICINE, 2018, 37 (09) :1515-1530
[9]   IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies [J].
Dai, Mingwei ;
Ming, Jingsi ;
Cai, Mingxuan ;
Liu, Jin ;
Yang, Can ;
Wan, Xiang ;
Xu, Zongben .
BIOINFORMATICS, 2017, 33 (18) :2882-2889
[10]   ON COMBINING INDIVIDUAL-LEVEL DATA WITH SUMMARY DATA IN STATISTICAL INFERENCES [J].
Deng, Lu ;
Fu, Sheng ;
Qin, Jing ;
Yu, Kai .
STATISTICA SINICA, 2024, 34 (03) :1505-1520