Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

被引:26
作者
Bey, Romain [1 ,2 ]
Goussault, Romain [3 ]
Grolleau, Francois [1 ,2 ]
Benchoufi, Mehdi [1 ,2 ]
Porcher, Raphael [1 ,2 ]
机构
[1] Univ Paris, Ctr Res Epidemiol & Stat CRESS, French Inst Hlth & Med Res, Natl Inst Agr Res INRA,INSERM, Paris, France
[2] Nantes Univ, Ctr Hosp Univ Nantes, CIC 1413, Ctr Res Cancerol & Immunol Nantes Angers CRCINA,D, Nantes, France
[3] Nantes Univ, Ctr Hosp Univ Nantes, Ctr Res Cancerol & Immunol Nantes Angers CRCINA, Dermatol Dept, Nantes CIC 1413, France
关键词
federated learning; privacy; validation; duplicated electronic health records; data leakage; ELECTRONIC HEALTH RECORDS;
D O I
10.1093/jamia/ocaa096
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs). Materials and Methods: Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records. Results: In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome. Discussion: Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates. Conclusion: Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.
引用
收藏
页码:1244 / 1251
页数:8
相关论文
共 55 条
[1]  
Aggarwal C. C., 2005, P 31 VLDB C TRONDH N, V5, P901, DOI DOI 10.5555/1083592.1083696
[2]   GENETICS Genealogy Databases Enable Naming Of Anonymous DNA Donors [J].
Bohannon, John .
SCIENCE, 2013, 339 (6117) :262-262
[3]  
Bonawitz K, ARXIV190201046V2
[4]   Practical Secure Aggregation for Privacy-Preserving Machine Learning [J].
Bonawitz, Keith ;
Ivanov, Vladimir ;
Kreuter, Ben ;
Marcedone, Antonio ;
McMahan, H. Brendan ;
Patel, Sarvar ;
Ramage, Daniel ;
Segal, Aaron ;
Seth, Karn .
CCS'17: PROCEEDINGS OF THE 2017 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2017, :1175-1191
[5]  
Brickell J., 2008, P 14 ACM SIGKDD INT, P70, DOI [DOI 10.1145/1401890.1401904, 10.1145/1401890.1401904]
[6]  
Caldicott F., 2016, REV DATA SECURITY CO
[7]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[8]  
Cheng K, ARXIV190108755V1
[9]   Comment: On the privacy-conscientious use of mobile phone data [J].
de Montjoye, Yves-Alexandre ;
Gambs, Sebastien ;
Blondel, Vincent ;
Canright, Geoffrey ;
de Cordes, Nicolas ;
Deletaille, Sebastien ;
Engo-Monsen, Kenth ;
Garcia-Herranz, Manuel ;
Kendall, Jake ;
Kerry, Cameron ;
Krings, Gautier ;
Letouze, Emmanuel ;
Luengo-Oroz, Miguel ;
Oliver, Nuria ;
Rocher, Luc ;
Rutherford, Alex ;
Smoreda, Zbigniew ;
Steele, Jessica ;
Wetter, Erik ;
Pentland, Alex Sandy ;
Bengtsson, Linus .
SCIENTIFIC DATA, 2018, 5
[10]   Unsupervised stratification of cross-validation for accuracy estimation [J].
Diamantidis, NA ;
Karlis, D ;
Giakoumakis, EA .
ARTIFICIAL INTELLIGENCE, 2000, 116 (1-2) :1-16