Generalized genomic data sharing for differentially private federated learning

被引:8
作者
Al Aziz, Md Momin [1 ]
Anjum, Md Monowar [1 ]
Mohammed, Noman [1 ]
Jiang, Xiaoqian [2 ]
机构
[1] Univ Manitoba, Comp Sci, 66 Chancellors Circle, Winnipeg, MB R3T 2N2, Canada
[2] Univ Texas Hlth Sci Ctr Houston, Sch Biomed Informat, 7000 Fannin St, Houston, TX 77030 USA
基金
加拿大自然科学与工程研究理事会;
关键词
Differentially private data sharing; Exponential mechanism; Differentially private federated learning; Privacy-preserving machine learning;
D O I
10.1016/j.jbi.2022.104113
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The success behind Machine Learning (ML) methods has largely been attributed to the quality and quantity of the available data which can spread across multiple owners. A Federated Learning (FL) from distributed datasets often provides a reliable solution that provides valuable insight. For a genomic dataset, such data have also proven to be sensitive which requires additional safety mechanisms before any sharing or ML operations. We propose a generalized gene expression data sharing method using a differentially private mechanism. Due to the large number of genes available, the data dimension is also reduced to accommodate smaller privacy budgets as we utilize an exponential mechanism to create a private histogram from numeric expression data. The output histogram can be used in any federated machine learning setting having multiple data owners. The proposed solution was submitted to genomic data security and privacy competition, iDash 2020 where it ranked third among 55 teams. We extend the proposed solution and experimented with two different machine learning algorithms and different settings. The experimental results show that it takes around 8 s to train a model while achieving 0.89 AUC with only a privacy budget of 5. The paper outlined a method to share gene expression data for Federated Learning using a privacy-preserving mechanism. Different experimental settings and recent competition results show the efficacy of the method which can be further extended to other genomic datasets and machine learning algorithms.
引用
收藏
页数:12
相关论文
共 37 条
  • [1] Deep Learning with Differential Privacy
    Abadi, Martin
    Chu, Andy
    Goodfellow, Ian
    McMahan, H. Brendan
    Mironov, Ilya
    Talwar, Kunal
    Zhang, Li
    [J]. CCS'16: PROCEEDINGS OF THE 2016 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2016, : 308 - 318
  • [2] Privacy-preserving techniques of genomic data-a survey
    Al Aziz, Md Momin
    Sadat, Md Nazmus
    Alhadidi, Dima
    Wang, Shuang
    Jiang, Xiaoqian
    Brown, Cheryl L.
    Mohammed, Noman
    [J]. BRIEFINGS IN BIOINFORMATICS, 2019, 20 (03) : 887 - 895
  • [3] Al Aziz Md Momin, 2021, ACM T COMPUT HEALTHC, V2, P1
  • [4] Machine learning and genomics: precision medicine versus patient privacy
    Azencott, C. -A.
    [J]. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2018, 376 (2128):
  • [5] Bagdasaryan E, 2020, PR MACH LEARN RES, V108, P2938
  • [6] Federated learning of predictive models from federated Electronic Health Records
    Brisimi, Theodora S.
    Chen, Ruidi
    Mela, Theofanie
    Olshevsky, Alex
    Paschalidis, Ioannis Ch.
    Shi, Wei
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2018, 112 : 59 - 67
  • [7] Carpov Sergiu, 2021, IACR CRYPTOL EPRINT, V2021, P200
  • [8] Chen June., 2020, BIORXIV, P2020
  • [9] Chen T., 2021, XGBOOST DOCUMENTATIO
  • [10] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794