Composite Bloom Filters for Secure Record Linkage

被引:49
作者
Durham, Elizabeth A. [1 ]
Kantarcioglu, Murat [2 ]
Xue, Yuan [3 ]
Toth, Csaba [1 ]
Kuzu, Mehmet [2 ]
Malin, Bradley [4 ,5 ]
机构
[1] Vanderbilt Univ, Dept Biomed Informat, Nashville, TN 37232 USA
[2] Univ Texas Dallas, Dept Comp Sci, Dallas, TX 75083 USA
[3] Vanderbilt Univ, Dept Elect Engn & Comp Sci, Nashville, TN 37232 USA
[4] Vanderbilt Univ, Dept Biomed Informat, Nashville, TN 37232 USA
[5] Vanderbilt Univ, EECS, Nashville, TN 37232 USA
基金
美国国家科学基金会;
关键词
Data matching; record linkage; entity resolution; privacy; security; Bloom filter; STRING COMPARATORS;
D O I
10.1109/TKDE.2013.91
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The process of record linkage seeks to integrate instances that correspond to the same entity. Record linkage has traditionally been performed through the comparison of identifying field values (e.g., Surname), however, when databases are maintained by disparate organizations, the disclosure of such information can breach the privacy of the corresponding individuals. Various private record linkage (PRL) methods have been developed to obscure such identifiers, but they vary widely in their ability to balance competing goals of accuracy, efficiency and security. The tokenization and hashing of field values into Bloom filters (BF) enables greater linkage accuracy and efficiency than other PRL methods, but the encodings may be compromised through frequency-based cryptanalysis. Our objective is to adapt a BF encoding technique to mitigate such attacks with minimal sacrifices in accuracy and efficiency. To accomplish these goals, we introduce a statistically-informed method to generate BF encodings that integrate bits from multiple fields, the frequencies of which are provably associated with a minimum number of fields. Our method enables a user-specified tradeoff between security and accuracy. We compare our encoding method with other techniques using a public dataset of voter registration records and demonstrate that the increases in security come with only minor losses to accuracy.
引用
收藏
页码:2956 / 2968
页数:13
相关论文
共 49 条
[1]  
Agrawal R., P 2004 ACM SIGMOD IN, P873
[2]  
Agrawal Rakesh, 2003, P 2003 ACM SIGMOD IN, P86, DOI DOI 10.1145/872757.872771
[3]  
Al-Lawati A., 2005, PROC ACM SIGMOD WORK, P59, DOI DOI 10.1145/1077501.1077513
[4]  
[Anonymous], 2010, Proceedings of the 13th International Conference on Extending Database Technology, EDBT'10, DOI [10.1145/1739041.1739059, DOI 10.1145/1739041.1739059]
[5]  
[Anonymous], 2004, Proc. of the ACM SIGMOD Workshop on Data Mining and Knowledge Discovery, DOI DOI 10.1145/1008694.1008697
[6]  
[Anonymous], FED INF PROC STAND P
[7]  
[Anonymous], 2012, DATA MATCHING CONCEP, DOI DOI 10.1007/978-3-642-31164-2
[8]  
[Anonymous], 2003, IIWeb
[9]  
Atallah M.J., 2003, ACM WPES 2003, P39
[10]  
Berman JJ, 2004, ARCH PATHOL LAB MED, V128, P344