Web-based Arabic/English Duplicate Record Detection with Nested Blocking Technique

被引:0
|
作者
Higazy, Azza [1 ]
El Tobely, Tarek [1 ]
Yousef, Ahmed H. [2 ]
Sarhan, Amany [1 ]
机构
[1] Tanta Univ, Comp & Control Dept, Tanta, Egypt
[2] Ain Shams Univ, Dept Comp & Control, Cairo, Egypt
来源
2013 8TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING & SYSTEMS (ICCES) | 2013年
关键词
duplicate record detectio; matching data cleaning; indexing; data integration; entity matching; LINKAGE;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data accuracy and quality affects the success of any business intelligence and data mining solutions. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset, this operation becomes more complicated when entities are identified by a string value like the case of person names. These data inaccuracy problems exist due to misspelling and wide range of typographical variations especially with non-Latin languages like Arabic. Up to authors' knowledge, the previously proposed duplicate record detection (DRD) algorithms and frameworks do not support Arabic language and have some configuration difficulties. In this paper an English/Arabic enabled web-based framework is designed and implemented, considering the wide range variations in Arabic language. Improved indexing/blocking techniques used to allow fast processing. The framework is implemented and verified by several case studies. Results showed that the framework has substantial improvements compared to known techniques.
引用
收藏
页码:313 / 318
页数:6
相关论文
共 2 条
  • [1] A Method for Duplicate Record Detection Based on Decision Tree
    Lin, Guangyan
    Qian, Yuxiang
    Zhang, Yiqiong
    2016 3RD INTERNATIONAL CONFERENCE ON POWER AND ENERGY SYSTEMS (PES 2016), 2016, 4 : 146 - 150
  • [2] Web-based visual data exploration for improved radiological source detection
    Weber, Gunther H.
    Bandstra, Mark S.
    Chivers, Daniel H.
    Elgammal, Hamdy H.
    Hendrix, Valerie
    Kua, John
    Maltz, Jonathan S.
    Muriki, Krishna
    Ong, Yeongshnn
    Song, Kai
    Quinlan, Michael J.
    Ramakrishnan, Lavanya
    Quiter, Brian J.
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (18)