Web archive profiling through CDX summarization

被引:11
作者
Alam, Sawood [1 ]
Nelson, Michael L. [1 ]
Van de Sompel, Herbert [2 ]
Balakireva, Lyudmila L. [2 ]
Shankar, Harihar [2 ]
Rosenthal, David S. H. [3 ]
机构
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] Los Alamos Natl Lab, Los Alamos, NM USA
[3] Stanford Univ Libraries, Stanford, CA USA
基金
美国国家科学基金会;
关键词
Web archives; Profiling; CDX files; Memento; Query routing;
D O I
10.1007/s00799-016-0184-4
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the crawler index files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator's URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we correctly identified about 78% of the URIs that were present or not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and 94% URIs with less than 10% relative cost without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a tenfold increase in the routing precision.
引用
收藏
页码:223 / 238
页数:16
相关论文
共 23 条
[1]  
Alam S., 2014, TECH REP
[2]   Web Archive Profiling Through CDX Summarization [J].
Alam, Sawood ;
Nelson, Michael L. ;
Van de Sompel, Herbert ;
Balakireva, Lyudmila L. ;
Shankar, Harihar ;
Rosenthal, David S. H. .
RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 2015, 9316 :3-14
[3]  
Alam Sawood, 2015, OBJECT RESOURCE STRE
[4]   Who and what links to the Internet Archive [J].
AlNoamany, Yasmin ;
Alsum, Ahmed ;
Weigle, Michele C. ;
Nelson, Michael L. .
INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2014, 14 (3-4) :101-115
[5]  
Alsum Ahmed, 2013, Research and Advanced Technology for Digital Libraries. International Conference on Theory and Practice of Digital Libraries, TPDL 2013. Proceedings: LNCS 8092, P60, DOI 10.1007/978-3-642-40501-3_7
[6]   Profiling web archive coverage for top-level domain and content language [J].
Alsum, Ahmed ;
Weigle, Michele C. ;
Nelson, Michael L. ;
Van de Sompel, Herbert .
INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2014, 14 (3-4) :149-166
[7]  
Ben-Kiki O., 2009, INGY DOT NET YAML AI
[8]   Routing Memento Requests Using Binary Classifiers [J].
Bornand, Nicolas J. ;
Balakireva, Lyudmila ;
Van de Sompel, Herbert .
2016 IEEE/ACM JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), 2016, :63-72
[9]  
CareInfo, 2013, 7089 RFC
[10]  
Chang C.-C. K., 1997, SIGMOD REC, V26, P207