Large-Scale Similarity Join with Edit-Distance Constraints

被引:0
|
作者
Lin, Chen [1 ,2 ]
Yu, Haiyang [1 ]
Weng, Wei [3 ]
He, Xianmang [4 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Technol, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Xiamen Univ Technol, Sch Comp & Informat Engn, Xiamen 361024, Peoples R China
[4] Ningbo Univ, Sch Informat & Technol, Ningbo 315122, Peoples R China
来源
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT II | 2014年 / 8422卷
关键词
Similarity join; big data; Map Reduce; data cleaning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on Map-Reduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
引用
收藏
页码:328 / 342
页数:15
相关论文
共 50 条
  • [21] Energy-efficient mapping of large-scale workflows under deadline constraints in big data computing systems
    Shu, Tong
    Wu, Chase Q.
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 110 : 515 - 530
  • [22] Large-Scale Deep Belief Nets With MapReduce
    Zhang, Kunlei
    Chen, Xue-Wen
    IEEE ACCESS, 2014, 2 : 395 - 403
  • [23] Distributed Similarity Join Over Data Streams Based on Earth Mover's Distance
    Xu J.
    Song C.
    Lv P.
    Li T.-S.
    Jisuanji Xuebao/Chinese Journal of Computers, 2019, 42 (08): : 1779 - 1796
  • [24] HEGJoin: Heterogeneous CPU-GPU Epsilon Grids for Accelerated Distance Similarity Join
    Gallet, Benoit
    Gowanlock, Michael
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2020), PT III, 2020, 12114 : 372 - 388
  • [25] Large-Scale Intelligent Microservices
    Hamilton, Mark
    Gonsalves, Nick
    Lee, Christina
    Raman, Anand
    Walsh, Brendan
    Prasad, Siddhartha
    Banda, Dalitso
    Zhang, Lucy
    Zhang, Lei
    Freeman, William T.
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 298 - 309
  • [26] Personalized recommendation based on large-scale implicit feedback
    Yin, Jian, 1953, Chinese Academy of Sciences (25): : 1953 - 1966
  • [27] The Family of MapReduce and Large-Scale Data Processing Systems
    Sakr, Sherif
    Liu, Anna
    Fayoumi, Ayman G.
    ACM COMPUTING SURVEYS, 2013, 46 (01)
  • [28] A survey of large-scale analytical query processing in MapReduce
    Doulkeridis, Christos
    Norvag, Kjetil
    VLDB JOURNAL, 2014, 23 (03) : 355 - 380
  • [29] A survey of large-scale analytical query processing in MapReduce
    Christos Doulkeridis
    Kjetil Nørvåg
    The VLDB Journal, 2014, 23 : 355 - 380
  • [30] Mining large-scale repetitive sequences in a MapReduce setting
    Cao, Hongfei
    Phinney, Michael
    Petersohn, Devin
    Merideth, Benjamin
    Shyu, Chi-Ren
    INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2016, 14 (03) : 210 - 228