Web-based Arabic/English Duplicate Record Detection with Nested Blocking Technique

被引:0
作者
Higazy, Azza [1 ]
El Tobely, Tarek [1 ]
Yousef, Ahmed H. [2 ]
Sarhan, Amany [1 ]
机构
[1] Tanta Univ, Comp & Control Dept, Tanta, Egypt
[2] Ain Shams Univ, Dept Comp & Control, Cairo, Egypt
来源
2013 8TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING & SYSTEMS (ICCES) | 2013年
关键词
duplicate record detectio; matching data cleaning; indexing; data integration; entity matching; LINKAGE;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data accuracy and quality affects the success of any business intelligence and data mining solutions. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset, this operation becomes more complicated when entities are identified by a string value like the case of person names. These data inaccuracy problems exist due to misspelling and wide range of typographical variations especially with non-Latin languages like Arabic. Up to authors' knowledge, the previously proposed duplicate record detection (DRD) algorithms and frameworks do not support Arabic language and have some configuration difficulties. In this paper an English/Arabic enabled web-based framework is designed and implemented, considering the wide range variations in Arabic language. Improved indexing/blocking techniques used to allow fast processing. The framework is implemented and verified by several case studies. Results showed that the framework has substantial improvements compared to known techniques.
引用
收藏
页码:313 / 318
页数:6
相关论文
共 18 条
[1]  
[Anonymous], RR200602 US BUR CENS
[2]  
Christen P, 2004, LECT NOTES ARTIF INT, V3056, P638
[3]  
Christen P., 2007, STUD COMPUT INTELL, V43, P127, DOI [DOI 10.1007/978-3-540-44918-86, 10.1007/978-3-540-44918-8_6, DOI 10.1007/978-3-540-44918-8_6]
[4]  
Christen P., 2008, Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08), P151, DOI [10.1145/1401890.1401913, DOI 10.1145/1401890.1401913]
[5]  
Christen P., 2009, ACM SIGKDD EXPLORATI, V11, P39
[6]  
Christen P., 2008, P 14 ACM SIGKDD INT, P1065, DOI DOI 10.1145/1401890.1402020
[7]  
Christen P, 2008, LECT NOTES ARTIF INT, V5012, P511, DOI 10.1007/978-3-540-68125-0_45
[9]   Efficient Techniques for Online Record Linkage [J].
Dey, Debabrata ;
Mookerjee, Vijay S. ;
Liu, Dengpan .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (03) :373-387
[10]  
El-Shishtawy T., 2013, INT J COMPUTATIONAL, V4, P87