Secure discovery of genetic relatives across large-scale and distributed genomic data sets

被引:2
作者
Hong, Matthew M. [1 ]
Froelicher, David [1 ,2 ]
Magner, Ricky [2 ]
Popic, Victoria [2 ]
Berger, Bonnie [1 ,2 ,3 ]
Cho, Hyunghoon [4 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
[2] Broad Inst Massachusetts Inst Technol & Harvard, Cambridge, MA 02142 USA
[3] MIT, Dept Math, Cambridge, MA 02139 USA
[4] Yale Univ, Dept Biomed Informat & Data Sci, New Haven, CT 06510 USA
基金
美国国家卫生研究院;
关键词
CRYPTIC RELATEDNESS; ASSOCIATIONS; INFERENCE; MODEL;
D O I
10.1101/gr.279057.124
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging owing to the burden of estimating kinship between all the pairs of individuals across data sets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing (LSH) approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us data sets. On a data set of 200,000 individuals split between two parties, SF-Relate detects 97% of third-degree or closer relatives within 15 h of runtime. Our work enables secure identification of relatives across large-scale genomic data sets.
引用
收藏
页码:1312 / 1323
页数:12
相关论文
共 44 条
  • [1] The "All of Us" Research Program
    Denny J.C.
    Rutter J.L.
    Goldstein D.B.
    Philippakis A.
    Smoller J.W.
    Jenkins G.
    Dishman E.
    [J]. NEW ENGLAND JOURNAL OF MEDICINE, 2019, 381 (07) : 668 - 676
  • [2] Data quality control in genetic case-control association studies
    Anderson, Carl A.
    Pettersson, Fredrik H.
    Clarke, Geraldine M.
    Cardon, Lon R.
    Morris, Andrew P.
    Zondervan, Krina T.
    [J]. NATURE PROTOCOLS, 2010, 5 (09) : 1564 - 1573
  • [3] Population Structure and Cryptic Relatedness in Genetic Association Studies
    Astle, William
    Balding, David J.
    [J]. STATISTICAL SCIENCE, 2009, 24 (04) : 451 - 471
  • [4] 2016, bioRxiv, DOI [10.1101/048181, 10.1101/048181, DOI 10.1101/048181]
  • [5] Secure large-scale genome-wide association studies using homomorphic encryption
    Blatt, Marcelo
    Gusev, Alexander
    Polyakov, Yuriy
    Goldwasser, Shafi
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2020, 117 (21) : 11608 - 11613
  • [6] On the resemblance and containment of documents
    Broder, AZ
    [J]. COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS, 1998, : 21 - 29
  • [7] The UK Biobank resource with deep phenotyping and genomic data
    Bycroft, Clare
    Freeman, Colin
    Petkova, Desislava
    Band, Gavin
    Elliott, Lloyd T.
    Sharp, Kevin
    Motyer, Allan
    Vukcevic, Damjan
    Delaneau, Olivier
    O'Connell, Jared
    Cortes, Adrian
    Welsh, Samantha
    Young, Alan
    Effingham, Mark
    McVean, Gil
    Leslie, Stephen
    Allen, Naomi
    Donnelly, Peter
    Marchini, Jonathan
    [J]. NATURE, 2018, 562 (7726) : 203 - +
  • [8] High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios
    Byrska-Bishop, Marta
    Evani, Uday S.
    Zhao, Xuefang
    Basile, Anna O.
    Abel, Haley J.
    Regier, Allison A.
    Corvelo, Andre
    Clarke, Wayne E.
    Musunuri, Rajeeva
    Nagulapalli, Kshithija
    Fairley, Susan
    Runnels, Alexi
    Winterkorn, Lara
    Lowy, Ernesto
    Flicek, Paul
    Germer, Soren
    Brand, Harrison
    Hall, Ira M.
    Talkowski, Michael E.
    Narzisi, Giuseppe
    Zody, Michael C.
    [J]. CELL, 2022, 185 (18) : 3426 - +
  • [9] Second-generation PLINK: rising to the challenge of larger and richer datasets
    Chang, Christopher C.
    Chow, Carson C.
    Tellier, Laurent C. A. M.
    Vattikuti, Shashaank
    Purcell, Shaun M.
    Lee, James J.
    [J]. GIGASCIENCE, 2015, 4
  • [10] Homomorphic Encryption for Arithmetic of Approximate Numbers
    Cheon, Jung Hee
    Kim, Andrey
    Kim, Miran
    Song, Yongsoo
    [J]. ADVANCES IN CRYPTOLOGY - ASIACRYPT 2017, PT I, 2017, 10624 : 409 - 437