An analysis of one-to-one matching algorithms for entity resolution

被引:4
作者
Papadakis, George [1 ]
Efthymiou, Vasilis [2 ]
Thanos, Emmanouil [3 ]
Hassanzadeh, Oktie [4 ]
Christen, Peter [5 ]
机构
[1] Natl & Kapodistrian Univ Athens, Athens, Greece
[2] Fdn Res & Technol Hellas, Iraklion, Greece
[3] Katholieke Univ Leuven, Leuven, Belgium
[4] IBM Res, Yorktown Hts, NY USA
[5] Australian Natl Univ, Canberra, Australia
关键词
Bipartite graphs; Graph matching; Clustering; Experimental evaluation; Record linkage;
D O I
10.1007/s00778-023-00791-3
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Entity resolution (ER) is the task of finding records that refer to the same real-world entities. A common scenario, which we refer to as Clean-Clean ER, is to resolve records across two clean sources (i.e., they are duplicate-free and contain one record per entity). Matching algorithms for Clean-Clean ER yield bipartite graphs, which are further processed by clustering algorithms to produce the end result. In this paper, we perform an extensive empirical evaluation of eight bipartite graph matching algorithms that take as input a bipartite similarity graph and provide as output a set of matched records. We consider a wide range of matching algorithms, including algorithms that have not previously been applied to ER, or have been evaluated only in other ER settings. We assess the relative performance of these algorithms with respect to accuracy and time efficiency over ten established real-world data sets, from which we generated over 700 different similarity graphs. Our results provide insights into the relative performance of these algorithms and guidelines for choosing the best one, depending on the data at hand.
引用
收藏
页码:1369 / 1400
页数:32
相关论文
共 63 条
  • [1] Assi A, 2019, IEEE INT CONF BIG DA, P1028, DOI 10.1109/BigData47090.2019.9006348
  • [2] ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms
    Aumuller, Martin
    Bernhardsson, Erik
    Faithfull, Alexander
    [J]. INFORMATION SYSTEMS, 2020, 87
  • [3] (Almost) all of entity resolution
    Binette, Olivier
    Steorts, Rebecca C.
    [J]. SCIENCE ADVANCES, 2022, 8 (12)
  • [4] Bojanowski P., 2017, T ASSOC COMPUT LING, V5, P135, DOI [DOI 10.1162/TACLA00051, 10.1162/tacl\_a\_00051, 10.1162/tacl_a_00051, DOI 10.1162/TACL_A_00051]
  • [5] Brunner U., 2020, P EDBT 2020, P463
  • [6] Chapman S., 2007, SIMMETRICS OPEN SOUR
  • [7] Christen P., 2012, DATA MATCHING CONCEP, DOI [DOI 10.1007/978-3-642-31164-2, 10.1007/978-3-642-31164-2]
  • [8] Christen Peter, 2020, Linking Sensitive Data, DOI DOI 10.1007/978-3-030-59706-1
  • [9] An Overview of End-to-End Entity Resolution for Big Data
    Christophides, Vassilis
    Efthymiou, Vasilis
    Palpanas, Themis
    Papadakis, George
    Stefanidis, Kostas
    [J]. ACM COMPUTING SURVEYS, 2021, 53 (06)
  • [10] Dasgupta Sanjoy, 2008, Algorithms