机构:
Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USAOld Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
Nelson, Michael L.
[1
]
Van de Sompel, Herbert
论文数: 0引用数: 0
h-index: 0
机构:
Los Alamos Natl Lab, Los Alamos, NM USAOld Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
Van de Sompel, Herbert
[2
]
Balakireva, Lyudmila L.
论文数: 0引用数: 0
h-index: 0
机构:
Los Alamos Natl Lab, Los Alamos, NM USAOld Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
Balakireva, Lyudmila L.
[2
]
Shankar, Harihar
论文数: 0引用数: 0
h-index: 0
机构:
Los Alamos Natl Lab, Los Alamos, NM USAOld Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
Shankar, Harihar
[2
]
Rosenthal, David S. H.
论文数: 0引用数: 0
h-index: 0
机构:
Stanford Univ Libraries, Stanford, CA USAOld Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
Rosenthal, David S. H.
[3
]
机构:
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] Los Alamos Natl Lab, Los Alamos, NM USA
[3] Stanford Univ Libraries, Stanford, CA USA
来源:
RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES
|
2015年
/
9316卷
关键词:
Web archives;
Profiling;
CDX Files;
Memento;
D O I:
10.1007/978-3-319-24592-8_1
中图分类号:
TP [自动化技术、计算机技术];
学科分类号:
0812 ;
摘要:
With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the CDX files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator's URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we gained up to 22% routing precision with less than 5% relative cost as compared to the complete knowledge profile without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a five fold increase in routing precision.