SEMI: A Scalable Entity Matching System Based on MapReduce

被引:0
作者
Chao, Pingfu [1 ,2 ]
Li, Yuming [1 ,2 ]
Gao, Zhu [2 ]
Fang, Junhua [1 ,2 ]
He, Xiaofeng [1 ,2 ]
Zhang, Rong [1 ,2 ]
机构
[1] E China Normal Univ, Inst Data Sci & Engn, Shanghai 200062, Peoples R China
[2] E China Normal Univ, Shanghai Key Lab Trustworthy Comp, Shanghai 200062, Peoples R China
来源
DATABASES THEORY AND APPLICATIONS | 2015年 / 9093卷
关键词
D O I
10.1007/978-3-319-19548-3_29
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
MapReduce framework provides a new platform for data integration on distributed environment. We demonstrate a MapReduce-based entity resolution framework which efficiently solves the matching problem for structured, semi-structured and unstructured entities. We propose a random-based data representation method for reducing network transmission; we implement our design on MapReduce and design two solutions for reducing redundant comparisons. Our demo provides an easy-to-use platform for entity matching and performance analysis. We also compare the performance of our algorithm with the state-of-the-art blocking-based methods.
引用
收藏
页码:328 / 332
页数:5
相关论文
共 4 条
  • [1] Baraglia Ranieri, 2010, Proceedings 2010 10th IEEE International Conference on Data Mining (ICDM 2010), P731, DOI 10.1109/ICDM.2010.70
  • [2] Dedoop: Efficient Deduplication with Hadoop
    Kolb, Lars
    Thor, Andreas
    Rahm, Erhard
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12): : 1878 - 1881
  • [3] Ravichandran Deepak., 2005, Proc. of 43th Annual Meeting of the Association for Computational Linguistics, P622
  • [4] Enriching the knowledge sources used in a maximum entropy part-of-speech tagger
    Toutanova, K
    Manning, CD
    [J]. PROCEEDINGS OF THE 2000 JOINT SIGDAT CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND VERY LARGE CORPORA, 2000, : 63 - 70