Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

被引:16
作者
De Vries, Timothy [1 ]
Ke, Hui [1 ]
Chawla, Sanjay [1 ]
Christen, Peter [2 ]
机构
[1] Univ Sydney, Sch Informat Technol, Sydney, NSW 2006, Australia
[2] Australian Natl Univ, Coll Engn & Comp Sci, Canberra, ACT 0200, Australia
关键词
Record linkage; blocking; suffix arrays;
D O I
10.1145/1921632.1921635
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using efficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70% in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.
引用
收藏
页数:27
相关论文
共 28 条
[1]   A fast linkage detection scheme for multi-source information integration [J].
Aizawa, A ;
Oyama, K .
INTERNATIONAL WORKSHOP ON CHALLENGES IN WEB INFORMATION RETRIEVAL AND INTEGRATION, PROCEEDINGS, 2005, :30-39
[2]  
[Anonymous], RR200602 US BUR CENS
[3]  
Bawa M., 2003, PROC 29 INT C VERY L, P922
[4]  
Baxter Rohan, 2003, P WORKSH DAT CLEAN R
[5]   SPACE/TIME TRADE/OFFS IN HASH CODING WITH ALLOWABLE ERRORS [J].
BLOOM, BH .
COMMUNICATIONS OF THE ACM, 1970, 13 (07) :422-&
[6]  
Broder Andrei, 2002, Internet mathematics, P636, DOI DOI 10.1080/15427951.2004.10129096
[7]  
Chazelle Bernard, 2004, P 15 ANN ACM SIAM S, P30
[8]  
Christen P., 2007, TRCS0703 AUSTR NAT U
[9]  
Christen P., 2006, P WORKSH MIN COMPL D
[10]  
Christen Peter., 2008, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, P1065, DOI DOI 10.1145/1401890.1402020