MementoMap Framework for Flexible and Adaptive Web Archive Profiling

被引:9
作者
Alam, Sawood [1 ]
Weigle, Michele C. [1 ]
Nelson, Michael L. [1 ]
Melo, Fernando [2 ]
Bicho, Daniel [2 ]
Gomes, Daniel [2 ]
机构
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] FCT Arquivopt, Lisbon, Portugal
来源
2019 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2019) | 2019年
关键词
Memento; Web Archiving; Archive Profiling; MementoMap; LAW;
D O I
10.1109/JCDL.2019.00033
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this work we proposed MementoMap, a flexible and adaptive framework to summarize holdings of a web archive efficiently. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the Arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of MemGator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives).
引用
收藏
页码:172 / 181
页数:10
相关论文
共 37 条
  • [1] al Masri R., 2017, JORDAN INVISIBLE HAN
  • [2] Alam S., 2016, P 16 ACM IEEE CS JOI
  • [3] Alam S., 2015, MEMGATOR MEMENTO AGG
  • [4] Web Archive Profiling Through Fulltext Search
    Alam, Sawood
    Nelson, Michael L.
    Van de Sompel, Herbert
    Rosenthal, David S. H.
    [J]. RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, TPDL 2016, 2016, 9819 : 121 - 132
  • [5] Web archive profiling through CDX summarization
    Alam, Sawood
    Nelson, Michael L.
    Van de Sompel, Herbert
    Balakireva, Lyudmila L.
    Shankar, Harihar
    Rosenthal, David S. H.
    [J]. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2016, 17 (03) : 223 - 238
  • [6] Web Archive Profiling Through CDX Summarization
    Alam, Sawood
    Nelson, Michael L.
    Van de Sompel, Herbert
    Balakireva, Lyudmila L.
    Shankar, Harihar
    Rosenthal, David S. H.
    [J]. RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 2015, 9316 : 3 - 14
  • [7] Alam Sawood., 2019, UNIFIED KEY VALUE ST
  • [8] Alam Sawood., 2019, MEMENTOMAP TOOL SUMM
  • [9] Alam Sawood, 2014, TECHNICAL REPORT
  • [10] Alam Sawood., 2015, OBJECT RESOURCE STRE