Context-Aware Duplicate Detection in Semi-structured Data Streams

被引:1
作者
Shukla, Parijat [1 ]
Somani, Arun K. [1 ]
机构
[1] Iowa State Univ, Dept Elect & Comp Engn, Ames, IA 50011 USA
来源
2014 IEEE WORLD CONGRESS ON SERVICES (SERVICES) | 2014年
关键词
data streams; duplicate detection; semi-structured data; novel architectures; GPUs; data shaping; XML;
D O I
10.1109/SERVICES.2014.46
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
State-of-the-art in duplicate detection in semi-structured data obtains significant improvement by exploiting the schema-related knowledge. Such schema-bound duplicate detection approaches, however, have severe limitations when dealing with multi-sourced, heterogeneous, high-velocity data streams. In this paper, we propose a novel context-aware duplicate detection system which is workload-and complexity-aware, and is adaptable to the underlying computing platform. The system operates in schema-oblivious manner, and relies upon information theory based heuristic and data shaping technique for efficient, and scalable duplicate detection in multi-sourced, heterogeneous data sets. Experiments with real-world data sets show speed up of up to 8X over state-of-the-art schemes, while maintaining upto 92 percent accuracy. In addition, our data shaping technique for GPGPU processing speeds up the duplicate detection throughput by up to two orders of magnitude.
引用
收藏
页码:216 / 223
页数:8
相关论文
共 20 条
  • [1] [Anonymous], 2002, APPL DEV TRENDS
  • [2] [Anonymous], 2006, IEEE Data Eng. Bull
  • [3] [Anonymous], P ACM SIGMOD INT C M
  • [4] [Anonymous], 1999, MODERN INFORM RETRIE
  • [5] [Anonymous], 2008, INTRO INFORM RETRIEV, DOI DOI 10.1017/CBO9780511809071
  • [6] Approximate joins for data-centric XML
    Augsten, Nikolaus
    Boehlen, Michael
    Dyreson, Curtis
    Gamper, Johann
    [J]. 2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 814 - +
  • [7] A survey on tree edit distance and related problems
    Bille, P
    [J]. THEORETICAL COMPUTER SCIENCE, 2005, 337 (1-3) : 217 - 239
  • [8] Feekin Amy., 2000, Proc. of the ACM symp. on Applied computing, V1, P323
  • [9] Guha S., 2002, P ACM SIGMOD INT C M, P287, DOI [10.1145/564691.564725, DOI 10.1145/564691.564725]
  • [10] Hernandez M. A., 1995, SIGMOD Record, V24, P127, DOI 10.1145/568271.223807