BigDedup: A Big Data Integration Toolkit for Duplicate Detection in Industrial Scenarios

被引:1
作者
Gagliardelli, Luca [1 ]
Zhu, Song [1 ]
Simonini, Giovanni [1 ]
Bergamaschi, Sonia [1 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
来源
TRANSDISCIPLINARY ENGINEERING METHODS FOR SOCIAL INNOVATION OF INDUSTRY 4.0 | 2018年 / 7卷
关键词
Duplicate detection; Entity Resolution; Data Integration; Record Linkage; Big Data; ENTITY RESOLUTION; META-BLOCKING;
D O I
10.3233/978-1-61499-898-3-1015
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Duplicate detection aims to identify different records in data sources that refer to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit to the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.
引用
收藏
页码:1015 / 1023
页数:9
相关论文
共 50 条
  • [1] Efficient duplicate detection approach for high dimensional big data
    Zhu W.
    Yin J.
    Deng Y.
    Long S.
    Qiu S.
    Jisuanji Yanjiu yu Fazhan, 3 (559-570): : 559 - 570
  • [2] Efficient and Effective Duplicate Detection in Hierarchical Data
    Leitao, Luis
    Calado, Pavel
    Herschel, Melanie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (05) : 1028 - 1041
  • [3] Data Preparation for Duplicate Detection
    Koumarelas, Ioannis
    Jiang, Lan
    Naumann, Felix
    ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2020, 12 (03):
  • [4] A Similar Duplicate Record Detection Algorithm for Big Data Based on MapReduce
    Song R.
    Yu T.
    Chen Y.
    Chen Y.
    Xia B.
    Shanghai Jiaotong Daxue Xuebao/Journal of Shanghai Jiaotong University, 2018, 52 (02): : 214 - 221
  • [5] A Survey On Duplicate Record Detection In Real World Data
    Dhivyabharathi, G., V
    Kumaresan, S.
    2016 3RD INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATION SYSTEMS (ICACCS), 2016,
  • [6] Data Duplicate Detection
    Medidar, Nikita
    Chavan, Manik
    2018 9TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2018,
  • [7] Populating Entity Name Systems for Big Data Integration
    Kejriwal, Mayank
    SEMANTIC WEB - ISWC 2014, PT II, 2014, 8797 : 521 - 528
  • [8] Duplicate Detection Exploiting Data Relationships
    Herschel, Melanie
    IT-INFORMATION TECHNOLOGY, 2009, 51 (04): : 231 - 234
  • [9] Challenges of big data integration in the life sciences
    Sven Fillinger
    Luis de la Garza
    Alexander Peltzer
    Oliver Kohlbacher
    Sven Nahnsen
    Analytical and Bioanalytical Chemistry, 2019, 411 : 6791 - 6800
  • [10] Challenges of big data integration in the life sciences
    Fillinger, Sven
    de la Garza, Luis
    Peltzer, Alexander
    Kohlbacher, Oliver
    Nahnsen, Sven
    ANALYTICAL AND BIOANALYTICAL CHEMISTRY, 2019, 411 (26) : 6791 - 6800