Gone, Gone, but Not Really, and Gone, But Not forgotten: A Typology of Website Recoverability

被引:0
|
作者
Ayala, Brenda Reyes [1 ]
机构
[1] Univ Alberta, Edmonton, AB, Canada
关键词
web archives; web archiving; content drift; link rot; reference rot; lost websites;
D O I
10.1145/3543873.3587671
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a qualitative analysis of the recoverability of various webpages on the live web, using their archived counterparts as a baseline. We used a heterogeneous dataset consisting of four web archive collections, each with varying degrees of content drift. We were able to recover a small number of webpages previously thought to have been lost and analyzed their content and evolution. Our analysis yielded three types of lost webpages: 1) those that are not recoverable (with three subtypes), 2) those that are fully recoverable, and 3) those that are partially recoverable. The analysis presented here attempts to establish clear definitions and boundaries between the different degrees of webpage recoverabilty. By using a few simple methods, web archivists could discover the new locations of web content that was previously deemed lost, and include them in future crawling efforts, and lead to more complete web archives with less content drift.
引用
收藏
页码:1208 / 1213
页数:6
相关论文
共 50 条