A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval

被引:0
作者
Baillie, Mark [1 ]
Carman, Mark J. [2 ]
Crestani, Fabio [2 ]
机构
[1] Univ Strathclyde, CIS Dept, Glasgow, Lanark, Scotland
[2] Univ Lugano, Fac Informat, Lugano, Switzerland
来源
ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS | 2009年 / 5478卷
基金
英国工程与自然科学研究理事会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The aim of query-based sampling is to obtain a sufficient, representative sample of an underlying (text) collection. Current measures for assessing sample quality are too coarse grain to be informative. This paper outlines a measure of finer granularity based on probabilistic topic models of text. The assumption we make is that a representative sample should capture the broad themes of the underlying text collection. If these themes are not captured, then resource selection will be affected in terms of performance, coverage and reliability. For example, resource selection algorithms that require extrapolation from a small sample of indexed documents to determine which collections are most likely to hold relevant documents may be affected by samples which do not reflect the topical density of a collection. To address this issue we propose to measure the relative entropy between topics obtained in a sample with respect to the complete collection. Topics are both modelled from the collection and inferred in the sample using latent Dirichlet allocation. The paper outlines an analysis and evaluation of this methodology across a number of collections and sampling algorithms.
引用
收藏
页码:485 / +
页数:2
相关论文
共 20 条
[1]  
[Anonymous], 2008, Introduction to information retrieval
[2]  
Azzopardi L, 2008, LECT NOTES COMPUT SC, V4956, P482
[3]  
Baillie M, 2006, LECT NOTES COMPUT SC, V4209, P316
[4]  
BARYOSSEF Z, 2006, ACM WWW 2006, P367
[5]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[6]   Query-based sampling of text databases [J].
Callan, J ;
Connell, M .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2001, 19 (02) :97-130
[7]  
CALLAN JP, 2000, ADV INFORM RETRIEVAL
[8]  
CRASWELL N, 2000, DL 2000, P37
[9]   QProber: A system for automatic classification of hidden-Web databases [J].
Gravano, L ;
Ipeirotis, PG ;
Sahami, M .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2003, 21 (01) :1-41
[10]   Finding scientific topics [J].
Griffiths, TL ;
Steyvers, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 :5228-5235