Login Register

Internet Archive announces broader crawler scope

Written by Gareth Halfacree

April 24, 2017 | 11:46

Tags: #archiveorg #crawler #search-engine #wayback-machine

Companies: #internet-archive

The Internet Archive has made the controversial decision to begin ignoring robots.txt, a directive file used by web servers to keep automated content crawlers away from selected content, in an effort to increase its coverage.

Despite being entirely funded by donations and troubled by the occasional fire, the Internet Archive is making considerable inroads into its self-appointed task to create a publicly accessible archive of everything it can get its hands on. The organisation has, in the last few years, launched in-browser vintage computing emulation, playable classic arcade games, a museum of de-fanged malware, an Amiga software library, a trove of internal documents from interactive fiction pioneer Infocom, to say nothing of its archive of vintage computing magazines.

Its most popular feature, however, is the Wayback Machine, a service which allows users to insert a URL and view any copies the Internet Archive's robots have captured through time. A fantastic resource both for research and for securing information which would otherwise be lost to history, the Wayback Machine has previously respected the robots.txt directives file that allows webmasters to lock automated content crawlers away from chosen files and directories, but now it will do so no longer.

'Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files,' explained Mark Graham in a blog post announcing the change. 'We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.'

'A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly. We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.'

While the shift will allow the Internet Archive access to a wider range of content and more control over what content remains within its archives, it does so by removing that control from webmasters - a move which has proven controversial, in particular in that ignoring the directives file altogether will also ignore sites which wish to block Internet Archive access specifically through its ia_archiver User Agent.

Discuss this in the forums