An analysis of one-to-one matching algorithms for entity resolution

被引:0
作者
George Papadakis
Vasilis Efthymiou
Emmanouil Thanos
Oktie Hassanzadeh
Peter Christen
机构
[1] National and Kapodistrian University of Athens,
[2] Foundation for Research and Technology - Hellas,undefined
[3] KU Leuven,undefined
[4] IBM Research,undefined
[5] Yorktown Heights,undefined
[6] Australian National University,undefined
来源
The VLDB Journal | 2023年 / 32卷
关键词
Bipartite graphs; Graph matching; Clustering; Experimental evaluation; Record linkage;
D O I
暂无
中图分类号
学科分类号
摘要
Entity resolution (ER) is the task of finding records that refer to the same real-world entities. A common scenario, which we refer to as Clean-Clean ER, is to resolve records across two clean sources (i.e., they are duplicate-free and contain one record per entity). Matching algorithms for Clean-Clean ER yield bipartite graphs, which are further processed by clustering algorithms to produce the end result. In this paper, we perform an extensive empirical evaluation of eight bipartite graph matching algorithms that take as input a bipartite similarity graph and provide as output a set of matched records. We consider a wide range of matching algorithms, including algorithms that have not previously been applied to ER, or have been evaluated only in other ER settings. We assess the relative performance of these algorithms with respect to accuracy and time efficiency over ten established real-world data sets, from which we generated over 700 different similarity graphs. Our results provide insights into the relative performance of these algorithms and guidelines for choosing the best one, depending on the data at hand.
引用
收藏
页码:1369 / 1400
页数:31
相关论文
共 92 条
[1]  
Aumüller M(2020)Ann-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms Inf. Syst. 87 eabi8021-146
[2]  
Bernhardsson E(2022)(Almost) all of entity resolution Sci. Adv. 8 135-127:42
[3]  
Faithfull AJ(2017)Enriching word vectors with subword information Trans. Assoc. Comput. Linguist. 5 127:1-30
[4]  
Binette O(2021)An overview of end-to-end entity resolution for big data ACM Comput. Surv. 53 1-3:30
[5]  
Steorts RC(2006)Statistical comparisons of classifiers over multiple data sets J. Mach. Learn. Res. 7 3:1-1210
[6]  
Bojanowski P(2020)Transforming pairwise duplicates to entity clusters for high-quality duplicate detection ACM J. Data Inf. Qual. 12 1183-615
[7]  
Grave E(1969)A theory for record linkage J. Am. Stat. Assoc. 64 596-15
[8]  
Joulin A(1987)Fibonacci heaps and their uses in improved network optimization algorithms J. ACM 34 9-5:39
[9]  
Mikolov T(1962)College admissions and the stability of marriage Am. Math. Mon. 69 5:1-708
[10]  
Christophides V(2008)Summarization system evaluation revisited: N-gram graphs ACM Trans. Speech Lang. Process. 5 705-456