FDup: a framework for general-purpose and efficient entity deduplication of record collections

被引:0
|
作者
De Bonis M. [1 ]
Manghi P. [1 ]
Atzori C. [1 ]
机构
[1] Istituto di Scienza e Tecnologie dell'Informazione “A. Faedo” (ISTI), Consiglio Nazionale delle Ricerche (CNR), Pisa
基金
欧盟地平线“2020”;
关键词
Deduplication; Scholarly communication;
D O I
10.7717/PEERJ-CS.1058
中图分类号
学科分类号
摘要
Deduplication is a technique aiming at identifying and resolving duplicate metadata records in a collection. This article describes FDup (Flat Collections Deduper), a general-purpose software framework supporting a complete deduplication workflow to manage big data record collections: metadata record data model definition, identification of candidate duplicates, identification of duplicates. FDup brings two main innovations: first, it delivers a full deduplication framework in a single easy-to-use software package based on Apache Spark Hadoop framework, where developers can customize the optimal and parallel workflow steps of blocking, sliding windows, and similarity matching function via an intuitive configuration file; second, it introduces a novel approach to improve performance, beyond the known techniques of "blocking" and "sliding window", by introducing a smart similarity matching function T-match. T-match is engineered as a decision tree that drives the comparisons of the fields of two records as branches of predicates and allows for successful or unsuccessful early-exit strategies. The efficacy of the approach is proved by experiments performed over big data collections of metadata records in the OpenAIRE Research Graph, a known open access knowledge base in Scholarly communication. © Copyright 2022 De Bonis et al.
引用
收藏
相关论文
共 50 条
  • [31] Plasduino: An inexpensive, general-purpose data acquisition framework for educational experiments
    Baldini, L.
    NUOVO CIMENTO C-COLLOQUIA AND COMMUNICATIONS IN PHYSICS, 2014, 37 (04): : 305 - 316
  • [32] Using modern graphics Architectures for general-purpose computing: A framework and analysis
    Thompson, CJ
    Hahn, SG
    Oskin, M
    35TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO-35), PROCEEDINGS, 2002, : 306 - 317
  • [33] MedInject: a General-Purpose Information Retrieval Framework Applied in a Medical Context
    Carvalho, Luiz Olmes
    Seraphim, Enzo
    Seraphim, Thatyana F. P.
    Traina, Agma J. M.
    Traina, Caetano, Jr.
    2014 IEEE 27TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2014, : 308 - 313
  • [34] General-purpose framework for real time control in nuclear fusion experiments
    Cavinato, M.
    Manduchi, G.
    Luchetta, A.
    Taliercio, C.
    IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 2006, 53 (03) : 1002 - 1008
  • [35] A general-purpose machine learning framework for predicting properties of inorganic materials
    Ward, Logan
    Agrawal, Ankit
    Choudhary, Alok
    Wolverton, Christopher
    NPJ COMPUTATIONAL MATERIALS, 2016, 2
  • [36] PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns
    Da Yan
    Wenwen Qu
    Guimu Guo
    Xiaoling Wang
    Yang Zhou
    The VLDB Journal, 2022, 31 : 253 - 286
  • [37] PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns
    Yan, Da
    Qu, Wenwen
    Guo, Guimu
    Wang, Xiaoling
    Zhou, Yang
    VLDB JOURNAL, 2022, 31 (02): : 253 - 286
  • [38] GPSF: General-Purpose Scheduling Framework for Container based on Cloud Environment
    Choi, Sungmin
    Myung, Rohyoung
    Choi, Heeseok
    Chung, Kwangsik
    Gil, Joonmin
    Yu, Heonchang
    2016 IEEE INTERNATIONAL CONFERENCE ON INTERNET OF THINGS (ITHINGS) AND IEEE GREEN COMPUTING AND COMMUNICATIONS (GREENCOM) AND IEEE CYBER, PHYSICAL AND SOCIAL COMPUTING (CPSCOM) AND IEEE SMART DATA (SMARTDATA), 2016, : 769 - 772
  • [39] A General-Purpose Query-Centric Framework for Querying Big Graphs
    Yan, Da
    Cheng, James
    Ozsu, M. Tamer
    Yang, Fan
    Lu, Yi
    Lui, John C. S.
    Zhang, Qizhen
    Ng, Wilfred
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2016, 9 (07): : 564 - 575
  • [40] A general-purpose machine learning framework for predicting properties of inorganic materials
    Logan Ward
    Ankit Agrawal
    Alok Choudhary
    Christopher Wolverton
    npj Computational Materials, 2