FDup: a framework for general-purpose and efficient entity deduplication of record collections

被引:2
|
作者
De Bonis, Michele [1 ]
Manghi, Paolo [1 ]
Atzori, Claudio [1 ]
机构
[1] CNR, Ist Sci & Tecnol Informaz A Faedo ISTI, Pisa, Italy
关键词
Deduplication; Scholarly communication;
D O I
10.7717/peerj-cs.1058
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deduplication is a technique aiming at identifying and resolving duplicate metadata records in a collection. This article describes FDup (Flat Collections Deduper), a general-purpose software framework supporting a complete deduplication workflow to manage big data record collections: metadata record data model definition, iden-tification of candidate duplicates, identification of duplicates. FDup brings two main innovations: first, it delivers a full deduplication framework in a single easy-to-use software package based on Apache Spark Hadoop framework, where developers can customize the optimal and parallel workflow steps of blocking, sliding windows, and similarity matching function via an intuitive configuration file; second, it introduces a novel approach to improve performance, beyond the known techniques of "blocking"and "sliding window", by introducing a smart similarity matching function T-match. T-match is engineered as a decision tree that drives the comparisons of the fields of two records as branches of predicates and allows for successful or unsuccessful early-exit strategies. The efficacy of the approach is proved by experiments performed over big data collections of metadata records in the OpenAIRE Research Graph, a known open access knowledge base in Scholarly communication.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] FDup: a framework for general-purpose and efficient entity deduplication of record collections
    De Bonis M.
    Manghi P.
    Atzori C.
    PeerJ Computer Science, 2022, 8
  • [2] A general-purpose compression scheme for large collections
    Cannane, A
    Williams, HE
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (03) : 329 - 355
  • [3] General-purpose digital ticket framework
    Fujimura, K
    Nakajima, Y
    PROCEEDINGS OF THE 3RD USENIX WORKSHOP ON ELECTRONIC COMMERCE, 1998, : 177 - 186
  • [4] A General-Purpose Framework for Genetic Improvement
    Marino, Francesco
    Squillero, Giovanni
    Tonda, Alberto
    PARALLEL PROBLEM SOLVING FROM NATURE - PPSN XIV, 2016, 9921 : 345 - 352
  • [5] Bioinspired framework for general-purpose learning
    de Toledo, SA
    Barreiro, JM
    FOUNDATIONS AND TOOLS FOR NEURAL MODELING, PROCEEDINGS, VOL I, 1999, 1606 : 507 - 516
  • [6] A GENERAL-PURPOSE FRAMEWORK FOR CAD ALGORITHMS
    RUBIN, SM
    IEEE COMMUNICATIONS MAGAZINE, 1991, 29 (05) : 56 - 62
  • [7] AN EFFICIENT GENERAL-PURPOSE PARALLEL COMPUTER
    GALIL, Z
    PAUL, WJ
    JOURNAL OF THE ACM, 1983, 30 (02) : 360 - 387
  • [8] General-purpose compression for efficient retrieval
    Cannane, A
    Williams, HE
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2001, 52 (05): : 430 - 437
  • [9] A Scheduling Framework for General-purpose Parallel Languages
    Fluet, Matthew
    Rainey, Mike
    Reppy, John
    ICFP'08: PROCEEDINGS OF THE 2008 SIGPLAN INTERNATIONAL CONFERENCE ON FUNCTIONAL PROGRAMMING, 2008, : 241 - 252
  • [10] A scheduling framework for general-purpose parallel languages
    Fluet, Matthew
    Rainey, Mike
    Reppy, John
    ACM SIGPLAN NOTICES, 2008, 43 (09) : 241 - 252