Collection selection for managed distributed document databases

被引:10
作者
D'Souza, D [1 ]
Thom, JA [1 ]
Zobel, J [1 ]
机构
[1] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic 3001, Australia
基金
澳大利亚研究理事会;
关键词
distributed document database; collection selection; meta-indexing; CORI;
D O I
10.1016/S0306-4573(03)00008-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In a distributed document database system, a query is processed by passing it to a set of individual collections and collating the responses. For a system with many such collections, it is attractive to first identify a small subset of collections as likely to hold documents of interest before interrogating only this small subset in more detail. A method for choosing collections that has been widely investigated is the use of a selection index, which captures broad information about each collection and its documents. In this paper, we re-evaluate several techniques for collection selection. We have constructed new sets of test data that reflect one way in which distributed collections would be used in practice, in contrast to the more artificial division into collections reported in much previous work. Using these managed collections, collection ranking based on document surrogates is more effective than techniques such as CORI that are based on collection lexicons. Moreover, these experiments demonstrate that conclusions drawn from artificial collections are of questionable validity. (C) 2003 Elsevier Ltd. All rights reserved.
引用
收藏
页码:527 / 546
页数:20
相关论文
共 29 条
[1]  
[Anonymous], P 21 ANN INT ACM SIG
[2]  
[Anonymous], P 18 INT ACM SIGIR C
[3]  
BROGLIO J, 1994, NATL I STANDARDS TEC, P29
[4]  
CALLAN J, 2000, ADV INFORM RETRIEVAL, P127
[5]  
CALLAN J, 1999, P ACM SIGMOD INT C M, P479
[6]  
CALLAN J, 2000, CMULTI00162 SCH COMP
[7]  
CRASWELL N, 2000, P 5 ACM C DIG LIB SA, P37
[8]  
D'Souza D., 2000, P 11 AUSTR DAT C ADC, P28
[9]  
DEKRETSER O, 1998, P 18 INT C DISTR COM, P66
[10]  
DSouza D., 1999, P 2 INT S COOP DAT S, P52