Archival Crawlers and Java']JavaScript: Discover More Stuff but Crawl More Slowly

被引:0
作者
Brunelle, Justin F. [1 ,2 ]
Weigle, Michele C. [2 ]
Nelson, Michael L. [2 ]
机构
[1] Mitre Corp, 7525 Colshire Dr, Mclean, VA 22101 USA
[2] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
来源
2017 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2017) | 2017年
基金
美国国家科学基金会;
关键词
Web Archiving; Digital Preservation; Memento; Web Crawling;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are correspondingly difficult to archive. JavaScript enables interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to discover and crawl all of the resources in deferred representations and the result of archiving deferred representations is archived web pages that are either incomplete or erroneously load embedded resources from the live web. We propose a method of discovering and archiving deferred representations and their descendants (representation states) that are only reachable through client-side events. Our approach identified an average of 38.5 descendants per seed URI crawled, 70.9% of which are reached through an onclick event. This approach also added 15.6 times more embedded resources than Heritrix to the crawl frontier, but at a crawl rate that was 38.9 times slower than simply using Heritrix. If our method was applied to the July 2015 Common Crawl dataset, a web-scale archival crawler will discover an additional 7.17 PB (5.12 times more) of information per year. This illustrates the significant increase in resources necessary for more thorough archival crawls.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 42 条
[1]  
Ainsworth S., 2016, WEB ARCH POPULAR MED
[2]  
Ainsworth S., 2011, Proceedings of the 11th ACM/IEEE Joint Conference on Digital Libraries, P133, DOI DOI 10.1145/1998076.1998100
[3]  
Ainsworth S.G., 2014, Tech. Rep
[4]  
[Anonymous], 2008, 46 ISOTC
[5]  
Banos V., 2015, IJDL, V17, P119
[6]  
Ben Saad M., 2011, Proceedings of the 11th ACM/IEEE Joint Conference on Digital Libraries, P113
[7]  
Bright A., 2014, Web evidence points to pro-Russia rebels in downing of MH17
[8]  
Brunelle J. F., 2015, IJDL, V17, P95
[9]  
Brunelle J.F., 2012, Zombies in the archives
[10]  
Brunelle J. F., 2013, REPLAYING SOPA PROTE