XML stream processing using tree-edit, distance embeddings

被引:33
作者
Garofalakis, M
Kumar, A
机构
[1] Bell Labs, Lucent Technol, Murray Hill, NJ 07974 USA
[2] Indian Inst Technol, Dept Comp Sci & Engn, New Delhi 110016, India
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2005年 / 30卷 / 01期
关键词
algorithms; performance; theory; XML; data streams; data synopses; approximate query processing; tree-edit distance; metric-space embeddings;
D O I
10.1145/1061318.1061326
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a,novel algorithm for obliviously embedding tree-edit distance metrics into an L-1 vector space while. guaranteeing a.(worst-case) upper bound of O(log(2) n log* n) on the distance distortion between any data trees with at most n n. odes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worst-case distortion bounds. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model.
引用
收藏
页码:279 / 332
页数:54
相关论文
共 53 条
[1]  
Acharya S, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P275, DOI 10.1145/304181.304207
[2]  
Alon N., 1996, Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, P20, DOI 10.1145/237814.237823
[3]  
ALON N, 1999, P 18 ACM SIGACT SIGM
[4]  
Altinel Mehmet., 2000, VLDB, P53
[5]  
[Anonymous], P 21 ACM SIGMOD SIGA
[6]  
[Anonymous], 2003, EXPLORATORY DATA MIN
[7]  
[Anonymous], P 27 INT C VER LARG
[8]  
Apostolico A, 1997, Pattern Matching Algorithms
[9]  
Babcock B., 2002, Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), P1, DOI DOI 10.1145/543613.543615
[10]  
BARYOSSEF Z, 2002, P 6 INT WORKSH RAND