Two-tiered correlation clustering method for entity resolution in big data

被引：0

作者：

机构：

[1] School of Computer and Information Technology, Beijing Jiaotong University, Beijing

来源：

Wang, Ning | 1600年 / Science Press卷 / 51期

关键词：

Big data; Common neighborhood; Correlation clustering; Data integration; Entity resolution; Noisy data;

D O I：

10.7544/issn1000-1239.2014.20131345

中图分类号：

学科分类号：

摘要：

Volume, velocity, variety and veracity are four striking features of big data, which bring new challenges to data integration. Entity resolution is one of the most important steps in data integration. For big data, conventional entity resolution methods tend to be inefficient and ineffective in practice, especially on the noise immunity. In order to address the inconsistency issue of resolution results produced by the big data's four features, we introduce the concept of common neighborhood into the correlation clustering problem. Our top tier for pre-partition is designed based on the neighborhood, which can quickly and effectively complete the preliminary partition of blocks. The introduction of the concept of kernel gives a more precise definition of the correlation degree between a node and a cluster. As a consequence, our bottom tier for adjustment can accurately cluster nodes and improve the accuracy of the correlation clustering. Our two-tiered method for entity resolution is simple and efficient for the use of coarse similarity function. Meanwhile, our method achieves good performance on noise immunity with the introduction of the neighborhood. Extensive experiments demonstrate that the proposed two-tiered method achieves high accuracy and good noise immunity compared with those traditional methods, and is also scalable for big data.

引用

页码：2108 / 2116

页数：8

共 29 条

[1] Dong X.L., Srivastava D., Big data integration , pp. 1245-1248, (2013)
[2] Meng X., Ci X., Big data management: Concepts, technology and challenges , Journal of Computer Research and Development, 50, 1, pp. 146-169, (2013)
[3] Bansal N., Blum A., Chawla S., Correlation clustering , Machine Learning, 56, 1-3, pp. 89-113, (2004)
[4] Ailon N., Charikar M., Newman A., Aggregating inconsistent information: Ranking and clustering, Journal of the ACM, 55, 5, (2008)
[5] Elsner M., Charniak E., You talking to me? A corpus and algorithm for conversation disentanglement , pp. 834-842, (2008)
[6] Bilenko M., Kamath B., Mooney R.J., Adaptive blocking: Learning to scale up record linkage , pp. 87-96, (2006)
[7] Winkler W.E., Overview of record linkage and current research directions , (2006)
[8] Herzog T.N., Scheuren F.J., Winkler W.E., Data Quality and Record Linkage Techniques, (2007)
[9] Dong X., Halevy A., Madhavan J., Reference reconciliation in complex information spaces , pp. 85-96, (2005)
[10] Elmagarmid A.K., Ipeirotis P.G., Verykios V.S., Duplicate record detection: A survey , 19, 1, pp. 1-16, (2007)

← 1 2 3 →