Protecting Privacy in Large Datasets-First We Assess the Risk; Then We Fuzzy the Data

被引:14
作者
Ursin, Giske [1 ,2 ,3 ]
Sen, Sagar [1 ,4 ]
Mottu, Jean-Marie [5 ]
Nygard, Mari [1 ]
机构
[1] Canc Registry Norway, POB 5313 Majorstuen, N-0304 Oslo, Norway
[2] Univ Oslo, Inst Basic Med Sci, Oslo, Norway
[3] Univ Southern Calif, Dept Prevent Med, Los Angeles, CA USA
[4] Simula Res Lab, Lysaker, Norway
[5] Univ Nantes, AtlanModels Team, INRIA, IMT A,LS2N, Nantes, France
关键词
D O I
10.1158/1055-9965.EPI-17-0172
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Background: Privacy of information is an increasing concern with the availability of large amounts of data from many individuals. Even when access to data is heavily controlled, and the data shared with researchers contain no personal identifying information, there is a possibility of reidentifying individuals. To avoid reidentification, several anonymization protocols are available. These include categorizing variables into broader categories to ensure more than one individual in each category, such as k-anonymization, as well as protocols aimed at adding noise to the data. However, data custodians rarely assess reidentification risks. Methods: We assessed the reidentification risk of a large realistic dataset based on screening data from over 5 million records on 0.9 million women in the Norwegian Cervical Cancer Screening Program, before and after we used old and new techniques of adding noise (fuzzification) of the data. Results: Categorizing date variables (applying k-anonymization) substantially reduced the possibility of reidentification of individuals. Adding a random factor, such as a fuzzy factor used here, makes it even more difficult to reidentify specific individuals. Conclusions: Our results show that simple techniques can substantially reduce the risk of reidentification. Impact: Registry owners and large-scale data custodians should consider estimating and if necessary, reducing reidentification risks before sharing large datasets. (C) 2017 AACR.
引用
收藏
页码:1219 / 1224
页数:6
相关论文
共 22 条
[1]  
Aggarwal CC, 2008, ADV DATABASE SYST, V34, P1, DOI 10.1007/978-0-387-70992-5
[2]   HIPAA regulations - A new era of medical-record privacy? [J].
Annas, GJ .
NEW ENGLAND JOURNAL OF MEDICINE, 2003, 348 (15) :1486-1490
[3]  
[Anonymous], 2015, J STAT SOFTW
[4]  
Bossi J., 2002, EUR J HLTH LAW, V9, P201
[5]   Estimating the re-identification risk of clinical data sets [J].
Dankar, Fida Kamal ;
El Emam, Khaled ;
Neisa, Angelica ;
Roffey, Tyson .
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2012, 12
[6]  
Emam Khaled El, 2009, Can J Hosp Pharm, V62, P307
[7]   Routes for breaching and protecting genetic privacy [J].
Erlich, Yaniv ;
Narayanan, Arvind .
NATURE REVIEWS GENETICS, 2014, 15 (06) :409-421
[8]  
Haas P. J., 1995, VLDB '95. Proceedings of the 21st International Conference on Very Large Data Bases, P311
[9]  
Heldal J, 2015, RAIRD PROJECT REMOTE
[10]  
Li N., 2007, P 2007 IEEE 23 INT C