bit-tech.net

Internet Archive announces broader crawler scope

Internet Archive announces broader crawler scope

The Internet Archive has announced that it is to begin ignoring the robots.txt directives file, meaning it will start to capture and archive entire websites whether webmasters want it to or not.

The Internet Archive has made the controversial decision to begin ignoring robots.txt, a directive file used by web servers to keep automated content crawlers away from selected content, in an effort to increase its coverage.

Despite being entirely funded by donations and troubled by the occasional fire, the Internet Archive is making considerable inroads into its self-appointed task to create a publicly accessible archive of everything it can get its hands on. The organisation has, in the last few years, launched in-browser vintage computing emulation, playable classic arcade games, a museum of de-fanged malware, an Amiga software library, a trove of internal documents from interactive fiction pioneer Infocom, to say nothing of its archive of vintage computing magazines.

Its most popular feature, however, is the Wayback Machine, a service which allows users to insert a URL and view any copies the Internet Archive's robots have captured through time. A fantastic resource both for research and for securing information which would otherwise be lost to history, the Wayback Machine has previously respected the robots.txt directives file that allows webmasters to lock automated content crawlers away from chosen files and directories, but now it will do so no longer.

'Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files,' explained Mark Graham in a blog post announcing the change. 'We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.'

'A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly. We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.'

While the shift will allow the Internet Archive access to a wider range of content and more control over what content remains within its archives, it does so by removing that control from webmasters - a move which has proven controversial, in particular in that ignoring the directives file altogether will also ignore sites which wish to block Internet Archive access specifically through its ia_archiver User Agent.

4 Comments

Discuss in the forums Reply
jb0 25th April 2017, 12:13 Quote
Quote:
While the shift will allow the Internet Archive access to a wider range of content and more control over what content remains within its archives, it does so by removing that control from webmasters - a move which has proven controversial, in particular in that ignoring the directives file altogether will also ignore sites which wish to block Internet Archive access specifically through its ia_archiver User Agent.
Webmasters never HAD control. Robots.txt was not an enforcable access control mechanism, merely a polite request. It worked only because most search engines agreed to abide by said requests. Apparently, some of the chinese ones use robots.txt exactly backwards, as a map of what parts of the page to crawl FIRST. Which is, you know, the obvious first thing to do once people start acting like a polite request is a real access control mechanism.

Most of the controversy is from people that simply don't understand the difference and think Internet Archive is somehow hacking every server on Earth to bypass the robots.txt firewall. It isn't even setting a bad precedent, since they're far from the first major bot to ignore robots.txt(or use it as a sitemap)

...

Tangentally, have they ever explained why they honor the CURRENT robots.txt file when trying to view previously-stored content? It always seemed to me that they should honor the robots.txt in effect at the time the site was saved, if anything.
Gareth Halfacree 25th April 2017, 14:37 Quote
Quote:
Originally Posted by jb0
Tangentally, have they ever explained why they honor the CURRENT robots.txt file when trying to view previously-stored content? It always seemed to me that they should honor the robots.txt in effect at the time the site was saved, if anything.
I think it started life as a "if you didn't want this archived, we'll politely take it down" and has since been replaced by the proper "you archived something I didn't want you to archive, please take it down" email address and/or DMCA notifications.
mi1ez 25th April 2017, 23:02 Quote
I would think this is essential functionality? Wouldn't most robots.txt files prevent crawlers from hitting scripts, styles, etc.? Wouldn't they be pretty useful in rendering the pages back at a later date?
Wwhat 3rd May 2017, 01:34 Quote
Ever since the big companies and the politicians found out about the archive and started to force them to remove all kinds of stuff it's really not the same anymore.
Log in

You are not logged in, please login with your forum account below. If you don't already have an account please register to start contributing.



Discuss in the forums