Large-Scale Similarity Join with Edit-Distance Constraints

被引:0
|
作者
Lin, Chen [1 ,2 ]
Yu, Haiyang [1 ]
Weng, Wei [3 ]
He, Xianmang [4 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Technol, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Xiamen Univ Technol, Sch Comp & Informat Engn, Xiamen 361024, Peoples R China
[4] Ningbo Univ, Sch Informat & Technol, Ningbo 315122, Peoples R China
来源
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT II | 2014年 / 8422卷
关键词
Similarity join; big data; Map Reduce; data cleaning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on Map-Reduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
引用
收藏
页码:328 / 342
页数:15
相关论文
共 50 条
  • [31] Greedy column subset selection for large-scale data sets
    Farahat, Ahmed K.
    Elgohary, Ahmed
    Ghodsi, Ali
    Kamel, Mohamed S.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 45 (01) : 1 - 34
  • [32] Large-scale multi-label ensemble learning on Spark
    Gonzalez-Lopez, Jorge
    Cano, Alberto
    Ventura, Sebastian
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS / 11TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING / 14TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS, 2017, : 893 - 900
  • [33] Big R: Large-scale Analytics on Hadoop using R
    Lara, Oscar D.
    Zhuang, Weiqiang
    Pannu, Adarsh
    2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS), 2014, : 569 - 576
  • [34] PLAR: Parallel Large-scale Attribute Reduction on Cloud Systems
    Zhang, Junbo
    Li, Tianrui
    Pan, Yi
    2013 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT), 2013, : 184 - 191
  • [35] Asyn-SimRank: An asynchronous large-scale simrank algorithm
    Wang, Chunlei
    Zhang, Yanfeng
    Bao, Yubin
    Zhao, Changkuan
    Yu, Ge
    Gao, Lixin
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2015, 52 (07): : 1567 - 1579
  • [36] Greedy column subset selection for large-scale data sets
    Ahmed K. Farahat
    Ahmed Elgohary
    Ali Ghodsi
    Mohamed S. Kamel
    Knowledge and Information Systems, 2015, 45 : 1 - 34
  • [37] A LARGE-SCALE STUDY OF WORLD MYTHS
    Thuillard, Marc
    Le Quellec, Jean-Loic
    d'huy, Julien
    Berezkin, Yuri
    TRAMES-JOURNAL OF THE HUMANITIES AND SOCIAL SCIENCES, 2018, 22 (04): : A1 - A44
  • [38] Data Provenance in Large-Scale Distribution
    Zhu, Yunan
    Che, Wei
    Shan, Chao
    Zhao, Shen
    ARTIFICIAL INTELLIGENCE AND SECURITY, ICAIS 2022, PT III, 2022, 13340 : 28 - 42
  • [39] Large-scale incremental processing with MapReduce
    Lee, Daewoo
    Kim, Jin-Soo
    Maeng, Seungryoul
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 36 : 66 - 79
  • [40] Large-Scale Automated Sleep Staging
    Sun, Haoqi
    Jia, Jian
    Goparaju, Balaji
    Huang, Guang-Bin
    Sourina, Olga
    Bianchi, Matt Travis
    Westover, M. Brandon
    SLEEP, 2017, 40 (10)