Web Archive Profiling Through CDX Summarization

被引:9
作者
Alam, Sawood [1 ]
Nelson, Michael L. [1 ]
Van de Sompel, Herbert [2 ]
Balakireva, Lyudmila L. [2 ]
Shankar, Harihar [2 ]
Rosenthal, David S. H. [3 ]
机构
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] Los Alamos Natl Lab, Los Alamos, NM USA
[3] Stanford Univ Libraries, Stanford, CA USA
来源
RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES | 2015年 / 9316卷
关键词
Web archives; Profiling; CDX Files; Memento;
D O I
10.1007/978-3-319-24592-8_1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the CDX files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator's URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we gained up to 22% routing precision with less than 5% relative cost as compared to the complete knowledge profile without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a five fold increase in routing precision.
引用
收藏
页码:3 / 14
页数:12
相关论文
共 13 条
  • [1] Alam Sawood, 2014, TECHNICAL REPORT
  • [2] Who and what links to the Internet Archive
    AlNoamany, Yasmin
    Alsum, Ahmed
    Weigle, Michele C.
    Nelson, Michael L.
    [J]. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2014, 14 (3-4) : 101 - 115
  • [3] Alsum Ahmed, 2013, Research and Advanced Technology for Digital Libraries. International Conference on Theory and Practice of Digital Libraries, TPDL 2013. Proceedings: LNCS 8092, P60, DOI 10.1007/978-3-642-40501-3_7
  • [4] Profiling web archive coverage for top-level domain and content language
    Alsum, Ahmed
    Weigle, Michele C.
    Nelson, Michael L.
    Van de Sompel, Herbert
    [J]. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2014, 14 (3-4) : 149 - 166
  • [5] [Anonymous], 2009, 28500 1SO
  • [6] Crockford D., 2006, 4627 RFC
  • [7] Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments
    Egghe, Leo
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2007, 58 (05): : 702 - 709
  • [8] Gailly Jean-Loup., 2013, GZIP FILE FORMAT
  • [9] Sanderson R., 2012, IIPC MEMENTO AGGREGA
  • [10] Sanderson R., 2012, P 12 ACMIEEE CS JOIN, P379