Large-Scale Similarity Join with Edit-Distance Constraints

被引:0
|
作者
Lin, Chen [1 ,2 ]
Yu, Haiyang [1 ]
Weng, Wei [3 ]
He, Xianmang [4 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Technol, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Xiamen Univ Technol, Sch Comp & Informat Engn, Xiamen 361024, Peoples R China
[4] Ningbo Univ, Sch Informat & Technol, Ningbo 315122, Peoples R China
来源
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT II | 2014年 / 8422卷
关键词
Similarity join; big data; Map Reduce; data cleaning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on Map-Reduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
引用
收藏
页码:328 / 342
页数:15
相关论文
共 50 条
  • [41] A Survey on Large-Scale Machine Learning
    Wang, Meng
    Fu, Weijie
    He, Xiangnan
    Hao, Shijie
    Wu, Xindong
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (06) : 2574 - 2594
  • [42] Evolving large-scale data stream analytics based on scalable PANFIS
    Za'in, Choiru
    Pratama, Mahardhika
    Pardede, Eric
    KNOWLEDGE-BASED SYSTEMS, 2019, 166 : 186 - 197
  • [43] Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce
    Thi-To-Quyen Tran
    Thuong-Cang Phan
    Laurent, Anne
    D'orazio, Laurent
    2020 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2020,
  • [44] Social Relation Extraction of Large-Scale Logistics Network Based on MapReduce
    Gui, Feng
    Zhang, Feng
    Ma, Yunlong
    Liu, Min
    Shen, Weiming
    2014 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC), 2014, : 2273 - 2277
  • [45] Performance Evaluation of Big Data Frameworks for Large-Scale Data Analytics
    Veiga, Jorge
    Exposito, Roberto R.
    Pardo, Xoan C.
    Taboada, Guillermo L.
    Tourino, Juan
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 424 - 431
  • [46] A distributed data management system to support large-scale data analysis
    Emara, Tamer Z.
    Huang, Joshua Zhexue
    JOURNAL OF SYSTEMS AND SOFTWARE, 2019, 148 : 105 - 115
  • [47] Key Nodes Discovery in Large-Scale Logistics Network Based on MapReduce
    Sun, Yuan
    Ma, Yunlong
    Zhang, Feng
    Ma, Yumin
    Shen, Weiming
    2015 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2015): BIG DATA ANALYTICS FOR HUMAN-CENTRIC SYSTEMS, 2015, : 1309 - 1314
  • [48] An Optimized Straggler Mitigation Framework for Large-Scale Distributed Computing Systems
    Said, Samar A.
    Habashy, Shahira M.
    Salem, Sameh A.
    Saad, Elsayed M.
    IEEE ACCESS, 2022, 10 : 97075 - 97088
  • [49] MapReduce for Large-scale Monitor Data Analyses
    Ding, Jianwei
    Liu, Yingbo
    Zhang, Li
    Wang, Jianmin
    2014 IEEE 13TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM), 2014, : 747 - 754
  • [50] ASYNCHRONOUS PARALLEL NONCONVEX LARGE-SCALE OPTIMIZATION
    Cannelli, L.
    Facchinei, F.
    Kungurtsev, V.
    Scutari, G.
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4706 - 4710