Profiling web archive coverage for top-level domain and content language

被引:27
作者
Alsum, Ahmed [1 ]
Weigle, Michele C. [2 ]
Nelson, Michael L.
Van de Sompel, Herbert [3 ]
机构
[1] Stanford Univ Libraries, Stanford, CA 94305 USA
[2] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[3] Alamos Natl Lab, Los Alamos, NM 87545 USA
基金
美国国家科学基金会;
关键词
Web archive; Federated search; Memento Aggregator;
D O I
10.1007/s00799-014-0118-y
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
The Memento Aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives' access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookuprequests to the most probable web archives. We define Recall(TM)(n) as the percentage of a Time Map that was returned using n web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average Recall(TM) = 0.96. If we exclude the Internet Archive from the list, we can reach Recall(TM) = 0.647 on average using only the remaining top three web archives.
引用
收藏
页码:149 / 166
页数:18
相关论文
共 58 条
[21]  
Craswell N., 2000, ACM 2000. Digital Libraries. Proceedings of the Fifth ACM Conference on Digital Libraries, P37, DOI 10.1145/336597.336628
[22]  
D'Souza D., 2000, P 11 AUSTR DAT C ADC, P28
[23]  
Gomes D., 2008, P 8 INT WEB ARCH WOR
[24]  
Gravano L., 1994, SIGMOD Record, V23, P126, DOI 10.1145/191843.191869
[25]   GlOSS:: Text-source discovery over the Internet [J].
Gravano, L ;
García-Molina, H ;
Tomasic, A .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1999, 24 (02) :229-264
[26]  
Grotke A., 2008, TECH REP
[27]  
Gulli A., 2005, SPECIAL INTEREST TRA, P902, DOI [10.1145/1062745.1062789, DOI 10.1145/1062745.1062789]
[28]  
Heslop H., 2002, TECH REP
[29]  
Ipeirotis P. G., 2002, Proceedings of the Twenty-eighth International Conference on Very Large Data Bases, P394
[30]  
Ipeirotis PG, 2001, SIGMOD REC, V30, P67, DOI 10.1145/376284.375671