Landmarks-based Blocking Method For Large-scale Entity Resolution

被引:0
作者
Herath, Samudra [1 ]
Roughan, Matthew [1 ]
Glonek, Gary [2 ]
机构
[1] Univ Adelaide, ARC Ctr Excellence Math & Stat Frontiers, Adelaide, SA, Australia
[2] Univ Adelaide, Sch Math Sci, Adelaide, SA, Australia
来源
2020 IEEE 7TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA 2020) | 2020年
关键词
Entity resolution; record linkage; data matching; multidimensional scaling; KD-trees; Nearest-Neighbour search;
D O I
10.1109/DSAA49011.2020.00110
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale entity resolution (ER) techniques have received tremendous attention due to the emergence of data processing within organizations and governments. The traditional ER process requires pairwise comparisons between each record when identifying records belong to the same entity, which is computationally prohibitive for large databases. With many existing indexing techniques to address this issue, it remains an open research question. We propose a landmarks-based indexing algorithm to reduce the possible pairwise comparisons of non-matches. The blocks are determined based on pre-selected records called landmarks in a multidimensional Euclidean space. The pair-wise comparisons only within these blocks reduce the search space immensely. Our method is scalable for big data entity resolution as it has O(n) insertion and query complexity.
引用
收藏
页码:773 / 774
页数:2
相关论文
共 3 条
[1]   An optimal algorithm for approximate nearest neighbor searching in fixed dimensions [J].
Arya, S ;
Mount, DM ;
Netanyahu, NS ;
Silverman, R ;
Wu, AY .
JOURNAL OF THE ACM, 1998, 45 (06) :891-923
[3]   MULTIDIMENSIONAL-SCALING BY OPTIMIZING GOODNESS OF FIT TO A NONMETRIC HYPOTHESIS [J].
KRUSKAL, JB .
PSYCHOMETRIKA, 1964, 29 (01) :1-27