BigDedup: A Big Data Integration Toolkit for Duplicate Detection in Industrial Scenarios

被引:1
作者
Gagliardelli, Luca [1 ]
Zhu, Song [1 ]
Simonini, Giovanni [1 ]
Bergamaschi, Sonia [1 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
来源
TRANSDISCIPLINARY ENGINEERING METHODS FOR SOCIAL INNOVATION OF INDUSTRY 4.0 | 2018年 / 7卷
关键词
Duplicate detection; Entity Resolution; Data Integration; Record Linkage; Big Data; ENTITY RESOLUTION; META-BLOCKING;
D O I
10.3233/978-1-61499-898-3-1015
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Duplicate detection aims to identify different records in data sources that refer to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit to the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.
引用
收藏
页码:1015 / 1023
页数:9
相关论文
共 50 条
  • [21] An Integration of Big Data and Cloud Computing
    Thingom, Chintureena
    Yeon, Guydeuk
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT 2016, VOL 2, 2017, 469 : 729 - 737
  • [22] Big data integration - an evolutionary perspective
    Dinu, Simona
    ADVANCED TOPICS IN OPTOELECTRONICS, MICROELECTRONICS AND NANOTECHNOLOGIES X, 2020, 11718
  • [23] Quarry: A User-centered Big Data Integration Platform
    Petar Jovanovic
    Sergi Nadal
    Oscar Romero
    Alberto Abelló
    Besim Bilalli
    Information Systems Frontiers, 2021, 23 : 9 - 33
  • [24] In Search of Big Medical Data Integration Solutions - A Comprehensive Survey
    Dhayne, Houssein
    Haque, Rafiqul
    Kilany, Rima
    Taher, Yehia
    IEEE ACCESS, 2019, 7 : 91265 - 91290
  • [25] Quarry: A User-centered Big Data Integration Platform
    Jovanovic, Petar
    Nadal, Sergi
    Romero, Oscar
    Abello, Alberto
    Bilalli, Besim
    INFORMATION SYSTEMS FRONTIERS, 2021, 23 (01) : 9 - 33
  • [26] Methodology of Big Data Integration from A Priori Unknown Heterogeneous Data Sources
    Samoylov, Alexey
    Sergeev, Nikolay
    Kucherova, Margarita
    Denisov, Boris
    PROCEEDINGS OF 2018 THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (CSAI 2018) / 2018 THE 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND MULTIMEDIA TECHNOLOGY (ICIMT 2018), 2018, : 131 - 135
  • [27] Reusable Big Data System for Industrial Data Mining - A Case Study on Anomaly Detection in Chemical Plants
    Borrison, Reuben
    Kloepper, Benjamin
    Chioua, Moncef
    Dix, Marcel
    Sprick, Barbara
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2018, PT I, 2018, 11314 : 611 - 622
  • [28] Web Data Integration and Mining Based on Big Data
    Zhang, Su-Zhi
    Qu, Xu-Kai
    Sun, Jia-Bin
    INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND COMMUNICATION ENGINEERING (CSCE 2015), 2015, : 80 - 84
  • [29] Data integration and mining based on web big data
    Zhang, Su-Zhi
    Qu, Xu-Kai
    Sun, Jia-Bin
    International Journal of Multimedia and Ubiquitous Engineering, 2015, 10 (06): : 123 - 130
  • [30] Cyberattack Detection Systems in Industrial Internet of Things (IIoT) Networks in Big Data Environments
    Orman, Abdullah
    APPLIED SCIENCES-BASEL, 2025, 15 (06):