FDup: a framework for general-purpose and efficient entity deduplication of record collections

被引:2
|
作者
De Bonis, Michele [1 ]
Manghi, Paolo [1 ]
Atzori, Claudio [1 ]
机构
[1] CNR, Ist Sci & Tecnol Informaz A Faedo ISTI, Pisa, Italy
关键词
Deduplication; Scholarly communication;
D O I
10.7717/peerj-cs.1058
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deduplication is a technique aiming at identifying and resolving duplicate metadata records in a collection. This article describes FDup (Flat Collections Deduper), a general-purpose software framework supporting a complete deduplication workflow to manage big data record collections: metadata record data model definition, iden-tification of candidate duplicates, identification of duplicates. FDup brings two main innovations: first, it delivers a full deduplication framework in a single easy-to-use software package based on Apache Spark Hadoop framework, where developers can customize the optimal and parallel workflow steps of blocking, sliding windows, and similarity matching function via an intuitive configuration file; second, it introduces a novel approach to improve performance, beyond the known techniques of "blocking"and "sliding window", by introducing a smart similarity matching function T-match. T-match is engineered as a decision tree that drives the comparisons of the fields of two records as branches of predicates and allows for successful or unsuccessful early-exit strategies. The efficacy of the approach is proved by experiments performed over big data collections of metadata records in the OpenAIRE Research Graph, a known open access knowledge base in Scholarly communication.
引用
收藏
页数:23
相关论文
共 50 条
  • [41] Almost-Orthogonal Layers for Efficient General-Purpose Lipschitz Networks
    Prach, Bernd
    Lampert, Christoph H.
    COMPUTER VISION, ECCV 2022, PT XXI, 2022, 13681 : 350 - 365
  • [42] Efficient general-purpose image compression with binary tree predictive coding
    Robinson, JA
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 1997, 6 (04) : 601 - 608
  • [43] Accurate and efficient general-purpose boilerplate detection for crawled web corpora
    Roland Schäfer
    Language Resources and Evaluation, 2017, 51 : 873 - 889
  • [44] An Efficient, General-Purpose Technique for Identifying Storm Cells in Geospatial Images
    Lakshmanan, Valliappa
    Hondl, Kurt
    Rabin, Robert
    JOURNAL OF ATMOSPHERIC AND OCEANIC TECHNOLOGY, 2009, 26 (03) : 523 - 537
  • [45] Accurate and efficient general-purpose boilerplate detection for crawled web corpora
    Schaefer, Roland
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (03) : 873 - 889
  • [46] EFFICIENT METHOD FOR COMPUTING NOISE IN A GENERAL-PURPOSE CAD PROGRAM IN APL
    ZEIN, DA
    CHANG, CS
    SELBO, KMA
    IEEE CIRCUITS & DEVICES, 1985, 1 (02): : 33 - 38
  • [47] SELP: A general-purpose framework for learning the norms from saliencies in spatiotemporal data
    Banerjee, Bonny
    Dutta, Jayanta K.
    NEUROCOMPUTING, 2014, 138 : 41 - 60
  • [48] FlexiDRAM: A Flexible in-DRAM Framework to Enable Parallel General-Purpose Computation
    Zhou, Ranyang
    Roohi, Arman
    Misra, Durga
    Angizi, Shaahin
    2022 ACM/IEEE INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN, ISLPED 2022, 2022,
  • [49] A general-purpose framework for parallel processing of large-scale LiDAR data
    Li, Zhenlong
    Hodgson, Michael E.
    Li, Wenwen
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2018, 11 (01) : 26 - 47
  • [50] A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud
    Zeng, Shulin
    Dai, Guohao
    Sun, Hanbo
    Liu, Jun
    Li, Shiyao
    Ge, Guangjun
    Zhong, Kai
    Guo, Kaiyuan
    Wang, Yu
    Yang, Huazhong
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2022, 15 (03)