Web Archive Profiling Through Fulltext Search

被引:5
作者
Alam, Sawood [1 ]
Nelson, Michael L. [1 ]
Van de Sompel, Herbert [2 ]
Rosenthal, David S. H. [3 ]
机构
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] Los Alamos Natl Lab, Los Alamos, NM USA
[3] Stanford Univ Libraries, Stanford, CA USA
来源
RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, TPDL 2016 | 2016年 / 9819卷
基金
美国国家科学基金会;
关键词
Web archive; Memento; Archive profiling; Random searcher; ENGINES;
D O I
10.1007/978-3-319-43997-6_10
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An archive profile is a high-level summary of a web archive's holdings that can be used for routing Memento queries to the appropriate archives. It can be created by generating summaries from the CDX files (index of web archives) which we explored in an earlier work. However, requiring archives to update their profiles periodically is difficult. Alternative means to discover the holdings of an archive involve sampling based approaches such as fulltext keyword searching to learn the URIs present in the response or looking up for a sample set of URIs and see which of those are present in the archive. It is the fulltext search based discovery and profiling that is the scope of this paper. We developed the Random Searcher Model (RSM) to discover the holdings of an archive by a random search walk. We measured the search cost of discovering certain percentages of the archive holdings for various profiling policies under different RSM configurations. We can make routing decisions of 80% of the requests correctly while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile.
引用
收藏
页码:121 / 132
页数:12
相关论文
共 19 条
[1]  
Alam S., 2015, OBJECT RESO IN PRESS
[2]   Web Archive Profiling Through CDX Summarization [J].
Alam, Sawood ;
Nelson, Michael L. ;
Van de Sompel, Herbert ;
Balakireva, Lyudmila L. ;
Shankar, Harihar ;
Rosenthal, David S. H. .
RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 2015, 9316 :3-14
[3]   Profiling web archive coverage for top-level domain and content language [J].
Alsum, Ahmed ;
Weigle, Michele C. ;
Nelson, Michael L. ;
Van de Sompel, Herbert .
INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2014, 14 (3-4) :149-166
[4]  
[Anonymous], P 21 ANN INT ACM SIG
[5]  
Blum A, 2006, SIAM PROC S, P238
[6]  
Bornand N., 2016, P 16 ACM IEEE CS JOI
[7]  
Chang C.-C. K., 1997, SIGMOD REC, V26, P207
[8]   Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments [J].
Egghe, Leo .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2007, 58 (05) :702-709
[9]  
LEVY AY, 1996, QUERYING HETEROGENEO
[10]   Query routing in large-scale digital library systems [J].
Liu, L .
15TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1999, :154-163