On Web Archives and White Houses
Here's a fun search to try on Google sometimes: "disallow inurl:robots.txt filetype:txt"
It struck me as a funny irony that the files which exist in order to tell search engines what to not index are themselves included and easily searchable. You can add additional keywords to that query in order to search for specific directory names that people don't want to be found, but I'll leave that to the Google hacker forums. The security implications were not what I found most interesting.
What did interest me was this:
The most popular robots.txt file at the time of this writing was the one belonging to the web site for the United States White House: http://www.whitehouse.gov/robots.txt... and it was immense. The file is quite easily one of the largest I have seen.
In order to understand why I found this troubling, I will quickly sum up what a "robots.txt" file is. Robots.txt is a file that by convention placed in the main folder of a web site which provides some information to the search engines (the robots) who visit. Good manners on the part of the search companies dictate that any robots they employ should be "well-behaved", which is to say they obey the limits in robots.txt, do not overload the site with too many simultaneous queries, and so forth.
The robots.txt file itself has two main functions:
- To tell search engines where they should not go, and consequently what not to include in their indexes.
- To tell search engines who have spidered the site in the past what they should forget about.
Well behaved search engines and archivers, such as Google or Web.Archive.org, use robots.txt as the ultimate authority on what is *not* to be found or remembered (see references). Robots.txt files with long lists of actual content cause the indicated content to vanish from these third-party sites.
I believe it is to the nation's detriment that the records of that content are purged from the third-party sites normally dedicated to making it available.
Regarding page removal, the whitehouse.gov site does clearly have this caveat on its 404 error pages:
This statement, though accurate, is misleading: it implies that any missing pages must have been from the previous administration. This is not the case. Here are a few lines from the robots.txt file which clearly indicate that more recent data has been moved or removed:
Disallow: /911/heroes/iraq Disallow: /911/heroes/text Disallow: /911/iraq
Disallow: /news/releases/2003/07/images/iraq Disallow: /news/releases/2003/07/images/print/iraq Disallow: /news/releases/2003/07/images/print/text Disallow: /news/releases/2003/07/images/text
The lines quoted were both present in the original file on May 29th 2005. The current administration has encompassed both September 2001 and July 2003.
To be fair, any web site is free to move or remove pages at any time. It is quite possible that what I am documenting here is no more than the competence of the webmasters in charge of the site.
However, modern search engines are completely capable of noticing the "404 File Not Found" error message and using that indicator to drop the missing file from their index. The existence of a robots.txt entry for a particular resource is not required in order to properly remove nonexistent pages from the search engines. From a webmaster's perspective, there is therefore no utility in maintaining such a list simply to keep the search engine indexes up to date. The only reason I see to maintain such an exhaustive list is to make sure that it is forgotten.
Anyone interested in poking their fingers into the government's memory hole are invited to try the searches below.
Fun Searches:
References:
- local mirror of http://www.whitehouse.gov/robots.txt circa May 29, 2005
- The United States White House: http://www.whitehouse.gov
- the current robots.txt file for http://www.whitehouse.gov
- The current 404 error page for http://www.whitehouse.gov
- Internet Archive: Wayback Machine - http://web.archive.org
- The Internet Archive's page exclusion policy
- Google - http://www.google.com
- Google's page removal policy