BigDedup: A Big Data Integration Toolkit for Duplicate Detection in Industrial Scenarios

被引:1
作者
Gagliardelli, Luca [1 ]
Zhu, Song [1 ]
Simonini, Giovanni [1 ]
Bergamaschi, Sonia [1 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
来源
TRANSDISCIPLINARY ENGINEERING METHODS FOR SOCIAL INNOVATION OF INDUSTRY 4.0 | 2018年 / 7卷
关键词
Duplicate detection; Entity Resolution; Data Integration; Record Linkage; Big Data; ENTITY RESOLUTION; META-BLOCKING;
D O I
10.3233/978-1-61499-898-3-1015
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Duplicate detection aims to identify different records in data sources that refer to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit to the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.
引用
收藏
页码:1015 / 1023
页数:9
相关论文
共 50 条
  • [31] The Integration Development and Upgrading Path of Industry 4.0 Architecture Industrial Engineering Network Driven by Big Data
    Li, Hui
    NEW APPROACHES FOR MULTIDIMENSIONAL SIGNAL PROCESSING, NAMSP 2022, 2023, 332 : 217 - 224
  • [32] Data integration from traditional to big data: main features and comparisons of ETL approaches
    Walha, Afef
    Ghozzi, Faiza
    Gargouri, Faiez
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (19) : 26687 - 26725
  • [33] TBtools: An Integrative Toolkit Developed for Interactive Analyses of Big Biological Data
    Chen, Chengjie
    Chen, Hao
    Zhang, Yi
    Thomas, Hannah R.
    Frank, Margaret H.
    He, Yehua
    Xia, Rui
    MOLECULAR PLANT, 2020, 13 (08) : 1194 - 1202
  • [34] MongoDB-Based Modular Ontology Building for Big Data Integration
    Abbes, Hanen
    Gargouri, Faiez
    JOURNAL ON DATA SEMANTICS, 2018, 7 (01) : 1 - 27
  • [35] Anomalies detection for big data
    Torres-Dominguez, Omar
    Sabater-Fernandez, Samuel
    Bravo-Ilisatigui, Lisandra
    Martin-Rodriguez, Diana
    Garcia-Borroto, Milton
    REVISTA FACULTAD DE INGENIERIA, UNIVERSIDAD PEDAGOGICA Y TECNOLOGICA DE COLOMBIA, 2019, 28 (50): : 62 - 75
  • [36] Design of a big data integration system for physical fitness training development
    Yao Y.
    Kang J.
    Journal of Commercial Biotechnology, 2020, 25 (01): : 47 - 56
  • [37] Scaling industrial applications for the Big Data era
    Sutic, Davor
    Varga, Ervin
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2022, 19 (01) : 117 - 139
  • [38] Application of the Integration of the Internet of Things and Big Data
    Wang, Yingli
    PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MECHANICAL, ELECTRONIC, CONTROL AND AUTOMATION ENGINEERING (MECAE 2018), 2018, 149 : 249 - 254
  • [39] Intelligent Fault Diagnosis for Industrial Big Data
    Si, Jia
    Li, Yibin
    Ma, Sile
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2018, 90 (8-9): : 1221 - 1233
  • [40] Intelligent Fault Diagnosis for Industrial Big Data
    Jia Si
    Yibin Li
    Sile Ma
    Journal of Signal Processing Systems, 2018, 90 : 1221 - 1233