Approximate processing of massive continuous quantile queries over high-speed data streams

被引:8
作者
Lin, XM [1 ]
Xu, J
Zhang, Q
Lu, HJ
Yu, JX
Zhou, XF
Yuan, YD
机构
[1] NICTA, Sydney, NSW 2052, Australia
[2] Univ New S Wales, Sch Comp Sci & Engn, Sydney, NSW 2052, Australia
[3] Univ British Columbia, Dept Comp Sci, Vancouver, BC V6T 1Z4, Canada
[4] CSIRO, ICT Ctr, E Hlth Res Ctr, Brisbane, Qld 4000, Australia
[5] Hong Kong Univ Sci & Technol, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
[6] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Shatin, Hong Kong, Peoples R China
[7] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Qld 4072, Australia
基金
澳大利亚研究理事会;
关键词
query processing; online computation; data mining;
D O I
10.1109/TKDE.2006.73
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Quantile computation has many applications including data mining and financial data analysis. It has been shown that an is an element of-approximate summary can be maintained so that, given a quantile query d (phi, is an element of), the data item at rank [phi N] may be approximately obtained within the rank error precision is an element of N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different phi and is an element of poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.
引用
收藏
页码:683 / 698
页数:16
相关论文
共 36 条
[1]  
[Anonymous], P 2002 INT C VER LAR
[2]  
[Anonymous], P ACM SIGMOD C MAN D
[3]  
*APT SYST INC, 2003, VERS 2 0 FIN AN PACK
[4]  
Arasu A., 2004, P 23 ACM SIGMOD SIGA, P286, DOI [DOI 10.1145/1055558.1055598, DOI 10.1145/1055558.1055598>]
[5]  
Babcock B., 2002, Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), P1, DOI DOI 10.1145/543613.543615
[6]  
Chen J., 2000, SIGMOD 00, P379, DOI DOI 10.1145/342009.335432
[7]  
Cormen T. H., 2001, Introduction to Algorithms, V2nd
[8]  
Cormode G, 2005, PROC INT CONF DATA, P20
[9]  
Cormode G., 2004, J ALGORITHMS, V55
[10]  
Cormode Graham, 2003, P VLDB, P464