Efficient processing of exact top-k queries over disk-resident sorted lists

被引:18
作者
Pang, HweeHwa [1 ]
Ding, Xuhua [1 ]
Zheng, Baihua [1 ]
机构
[1] Singapore Management Univ, Sch Informat Syst, Singapore, Singapore
关键词
Top-k query processing; Threshold algorithm; Bloom filter; WEB; AGGREGATION; ALGORITHMS; SYSTEMS;
D O I
10.1007/s00778-009-0174-x
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The top-k query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be large, the data are indexed with per-attribute sorted lists, and a threshold algorithm (TA) is applied on the lists involved in each query. The TA executes in two phases-find a cut-off threshold for the top-k result scores, then evaluate all the records that could score above the threshold. In this paper, we focus on exact top-k queries that involve monotonic linear scoring functions over disk-resident sorted lists. We introduce a model for estimating the depths to which each sorted list needs to be processed in the two phases, so that (most of) the required records can be fetched efficiently through sequential or batched I/Os. We also devise a mechanism to quickly rank the data that qualify for the query answer and to eliminate those that do not, in order to reduce the computation demand of the query processor. Extensive experiments with four different datasets confirm that our schemes achieve substantial performance speed-up of between two times and two orders of magnitude over existing TAs, at the expense of a memory overhead of 4.8 bits per attribute value. Moreover, our scheme is robust to different data distributions and query characteristics.
引用
收藏
页码:437 / 456
页数:20
相关论文
共 51 条
[1]   Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions [J].
Adomavicius, G ;
Tuzhilin, A .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (06) :734-749
[2]  
Akbarinia Reza, 2007, INT C VERY LARGE DAT, P495
[3]  
[Anonymous], TREC - Text Retrieval Conference
[4]   Anytime measures for top-k algorithms on exact and fuzzy data sets [J].
Arai, Benjamin ;
Das, Gautam ;
Gunopulos, Dimitrios ;
Koudas, Nick .
VLDB JOURNAL, 2009, 18 (02) :407-427
[5]  
Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463
[6]  
Bast H., 2006, P 32 INT C VERY LARG, P475
[7]   SPACE/TIME TRADE/OFFS IN HASH CODING WITH ALLOWABLE ERRORS [J].
BLOOM, BH .
COMMUNICATIONS OF THE ACM, 1970, 13 (07) :422-&
[8]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[9]   The threshold algorithm: From middleware systems to the relational engine [J].
Bruno, Nicolas ;
Wang, Hui .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (04) :523-537
[10]  
Chang KevinChen-Chuan., 2002, P ACM INT C MANAGEME, P346