Two-tiered correlation clustering method for entity resolution in big data

被引:0
|
作者
机构
[1] School of Computer and Information Technology, Beijing Jiaotong University, Beijing
来源
Wang, Ning | 1600年 / Science Press卷 / 51期
关键词
Big data; Common neighborhood; Correlation clustering; Data integration; Entity resolution; Noisy data;
D O I
10.7544/issn1000-1239.2014.20131345
中图分类号
学科分类号
摘要
Volume, velocity, variety and veracity are four striking features of big data, which bring new challenges to data integration. Entity resolution is one of the most important steps in data integration. For big data, conventional entity resolution methods tend to be inefficient and ineffective in practice, especially on the noise immunity. In order to address the inconsistency issue of resolution results produced by the big data's four features, we introduce the concept of common neighborhood into the correlation clustering problem. Our top tier for pre-partition is designed based on the neighborhood, which can quickly and effectively complete the preliminary partition of blocks. The introduction of the concept of kernel gives a more precise definition of the correlation degree between a node and a cluster. As a consequence, our bottom tier for adjustment can accurately cluster nodes and improve the accuracy of the correlation clustering. Our two-tiered method for entity resolution is simple and efficient for the use of coarse similarity function. Meanwhile, our method achieves good performance on noise immunity with the introduction of the neighborhood. Extensive experiments demonstrate that the proposed two-tiered method achieves high accuracy and good noise immunity compared with those traditional methods, and is also scalable for big data.
引用
收藏
页码:2108 / 2116
页数:8
相关论文
共 29 条
  • [1] Dong X.L., Srivastava D., Big data integration , pp. 1245-1248, (2013)
  • [2] Meng X., Ci X., Big data management: Concepts, technology and challenges , Journal of Computer Research and Development, 50, 1, pp. 146-169, (2013)
  • [3] Bansal N., Blum A., Chawla S., Correlation clustering , Machine Learning, 56, 1-3, pp. 89-113, (2004)
  • [4] Ailon N., Charikar M., Newman A., Aggregating inconsistent information: Ranking and clustering, Journal of the ACM, 55, 5, (2008)
  • [5] Elsner M., Charniak E., You talking to me? A corpus and algorithm for conversation disentanglement , pp. 834-842, (2008)
  • [6] Bilenko M., Kamath B., Mooney R.J., Adaptive blocking: Learning to scale up record linkage , pp. 87-96, (2006)
  • [7] Winkler W.E., Overview of record linkage and current research directions , (2006)
  • [8] Herzog T.N., Scheuren F.J., Winkler W.E., Data Quality and Record Linkage Techniques, (2007)
  • [9] Dong X., Halevy A., Madhavan J., Reference reconciliation in complex information spaces , pp. 85-96, (2005)
  • [10] Elmagarmid A.K., Ipeirotis P.G., Verykios V.S., Duplicate record detection: A survey , 19, 1, pp. 1-16, (2007)