BigDedup: A Big Data Integration Toolkit for Duplicate Detection in Industrial Scenarios

被引:1
作者
Gagliardelli, Luca [1 ]
Zhu, Song [1 ]
Simonini, Giovanni [1 ]
Bergamaschi, Sonia [1 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
来源
TRANSDISCIPLINARY ENGINEERING METHODS FOR SOCIAL INNOVATION OF INDUSTRY 4.0 | 2018年 / 7卷
关键词
Duplicate detection; Entity Resolution; Data Integration; Record Linkage; Big Data; ENTITY RESOLUTION; META-BLOCKING;
D O I
10.3233/978-1-61499-898-3-1015
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Duplicate detection aims to identify different records in data sources that refer to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit to the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.
引用
收藏
页码:1015 / 1023
页数:9
相关论文
共 50 条
  • [41] Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects
    Wreinbel, Robert
    INFORMATION INTEGRATION AND WEB INTELLIGENCE, IIWAS 2022, 2022, 13635 : 3 - 17
  • [42] Big Data Service Engine (BISE): Integration of Big Data Technologies for Human Centric Wellness Data
    Idris, Muhammad
    Hussain, Shujaat
    Ahmad, Mahmood
    Lee, Sungyoung
    2015 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2015, : 244 - 248
  • [43] Security Integration in Big Data Life Cycle
    Kanika
    Agrawal, Alka
    Khan, R. A.
    ADVANCES IN COMPUTING AND DATA SCIENCES, ICACDS 2016, 2017, 721 : 192 - 200
  • [44] A Multilevel Graph Representation for Big Data Interpretation in Real Scenarios
    Colace, Fracesco
    Lombardi, Marco
    Pascale, Francesco
    Santaniello, Domenico
    2018 3RD INTERNATIONAL CONFERENCE ON SYSTEM RELIABILITY AND SAFETY (ICSRS), 2018, : 40 - 47
  • [45] Towards Big Data Solutions for Industrial Tomography Data Processing
    Kowalska, Aleksandra
    Luczak, Piotr
    Sielski, Dawid
    Kowalski, Tomasz
    Romanowski, Andrzej
    Sankowski, Dominik
    PROCEEDINGS OF THE 2019 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2019, : 427 - 431
  • [46] Big Data Exploitation for Maritime Applications A multi-segment platform to enable maritime big data scenarios
    Kokkinakos, Panagiotis
    Michalitsi-Psarrou, Ariadni
    Mouzakitis, Spiros
    Alvertis, Iosif
    Askounis, Dimitris
    Koussouris, Sotiris
    2017 INTERNATIONAL CONFERENCE ON ENGINEERING, TECHNOLOGY AND INNOVATION (ICE/ITMC), 2017, : 1131 - 1136
  • [47] Data source selection for information integration in big data era
    Lin, Yiming
    Wang, Hongzhi
    Li, Jianzhong
    Gao, Hong
    INFORMATION SCIENCES, 2019, 479 : 197 - 213
  • [48] Challenges of Internet of Things and Big Data Integration
    Alansari, Zainab
    Anuar, Nor Badrul
    Kamsin, Amirrudin
    Soomro, Safeeullah
    Belgaum, Mohammad Riyaz
    Miraz, Mahdi H.
    Alshaer, Jawdat
    EMERGING TECHNOLOGIES IN COMPUTING, ICETIC 2018, 2018, 200 : 47 - 55
  • [49] Research on Key Issues of Data Integration Technology in Electric Power System in Big Data Environment
    Liu, Donglan
    Ma, Lei
    Liu, Xin
    Yu, Hao
    Tan, Hu
    Zhao, Xiaohong
    Zhao, Yong
    Lv, Guodong
    2017 IEEE 9TH INTERNATIONAL CONFERENCE ON COMMUNICATION SOFTWARE AND NETWORKS (ICCSN), 2017, : 1368 - 1372
  • [50] BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
    Pineiro, Cesar
    Pichel, Juan C.
    GIGASCIENCE, 2023, 12