Profiling web archive coverage for top-level domain and content language

被引:27
作者
Alsum, Ahmed [1 ]
Weigle, Michele C. [2 ]
Nelson, Michael L.
Van de Sompel, Herbert [3 ]
机构
[1] Stanford Univ Libraries, Stanford, CA 94305 USA
[2] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[3] Alamos Natl Lab, Los Alamos, NM 87545 USA
基金
美国国家科学基金会;
关键词
Web archive; Federated search; Memento Aggregator;
D O I
10.1007/s00799-014-0118-y
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
The Memento Aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives' access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookuprequests to the most probable web archives. We define Recall(TM)(n) as the percentage of a Time Map that was returned using n web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average Recall(TM) = 0.96. If we exclude the Internet Archive from the list, we can reach Recall(TM) = 0.647 on average using only the remaining top three web archives.
引用
收藏
页码:149 / 166
页数:18
相关论文
共 58 条
[1]  
Ainsworth S.G., 2011, P 11 ANN INT ACM IEE, P133, DOI DOI 10.1145/1998076.1998100
[2]  
Alnoamany Yasmin, 2013, Research and Advanced Technology for Digital Libraries. International Conference on Theory and Practice of Digital Libraries, TPDL 2013. Proceedings: LNCS 8092, P346, DOI 10.1007/978-3-642-40501-3_35
[3]  
AlNoamany Y, 2013, ACM-IEEE J CONF DIG, P339
[4]  
Alsum Ahmed, 2013, Research and Advanced Technology for Digital Libraries. International Conference on Theory and Practice of Digital Libraries, TPDL 2013. Proceedings: LNCS 8092, P60, DOI 10.1007/978-3-642-40501-3_7
[5]  
[Anonymous], 6393 ISO
[6]  
Aubry Sara, 2010, LIBER Quarterly, V20, P179
[7]  
Baeza-Yates R., 2011, MODERN INFORM RETRIE
[8]  
Bailey S., 2006, D LIB MAG, V12, P1082
[9]  
Baillie M, 2006, LECT NOTES COMPUT SC, V4209, P316
[10]   Random Sampling from a Search Engine's Index [J].
Bar-Yossef, Ziv ;
Gurevich, Maxim .
JOURNAL OF THE ACM, 2008, 55 (05)