Large-Scale Similarity Join with Edit-Distance Constraints

被引:0
|
作者
Lin, Chen [1 ,2 ]
Yu, Haiyang [1 ]
Weng, Wei [3 ]
He, Xianmang [4 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Technol, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Xiamen Univ Technol, Sch Comp & Informat Engn, Xiamen 361024, Peoples R China
[4] Ningbo Univ, Sch Informat & Technol, Ningbo 315122, Peoples R China
来源
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT II | 2014年 / 8422卷
关键词
Similarity join; big data; Map Reduce; data cleaning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on Map-Reduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
引用
收藏
页码:328 / 342
页数:15
相关论文
共 50 条
  • [1] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
    Nie, Tiezheng
    Lee, Wang-chien
    Shen, Derong
    Yu, Ge
    Kou, Yue
    WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149
  • [2] A distributed framework for large-scale semantic trajectory similarity join
    Tian, Ruijie
    Li, Jiajun
    Zhang, Weishi
    Wang, Fei
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 16205 - 16229
  • [3] A distributed framework for large-scale semantic trajectory similarity join
    Ruijie Tian
    Jiajun Li
    Weishi Zhang
    Fei Wang
    Multimedia Tools and Applications, 2024, 83 : 16205 - 16229
  • [4] Efficient large-scale distance-based join queries in spatialhadoop
    Garcia-Garcia, Francisco
    Corral, Antonio
    Iribarne, Luis
    Vassilakopoulos, Michael
    Manolopoulos, Yannis
    GEOINFORMATICA, 2018, 22 (02) : 171 - 209
  • [5] Efficient large-scale distance-based join queries in spatialhadoop
    Francisco García-García
    Antonio Corral
    Luis Iribarne
    Michael Vassilakopoulos
    Yannis Manolopoulos
    GeoInformatica, 2018, 22 : 171 - 209
  • [6] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [7] An efficient similarity join approach on large-scale high-dimensional data using random projection
    Ma, Youzhong
    Zhang, Ruiling
    Jia, Shijie
    Zhang, Yongxin
    Meng, Xiaofeng
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (20)
  • [8] Large-Scale Spatial Join Query Processing in Cloud
    You, Simin
    Zhang, Jianting
    Gruenwald, Le
    2015 13TH IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW), 2015, : 34 - 41
  • [9] Large-Scale Text Similarity Computing with Spark
    Bao, Xiaoan
    Dai, Shichao
    Zhang, Na
    Yu, Chenghai
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (04): : 95 - 100
  • [10] A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark
    Phan A.-C.
    Phan T.-C.
    Trieu T.-N.
    Tran T.-T.-Q.
    SN Computer Science, 2021, 2 (5)