Search Engine Showdown
 
 

« LookSmart To Be On Lycos | Blog Home | Google Field Search Problems »

Indexing Robots.txt Files

It appears that Google's spider is not only checking robots.txt files, it is also indexing and even caching some of them. Try a search on allinurl:robots.txt to see some examples, or see the cached copy of the Salon.com file.

It would be interesting to know why they are doing this. Other search engines, like AlltheWeb will index robots.txt files that do not follow the protocol as in the search for disallow user-agent url.all:robots.txt. (The results either have the robots.txt file not located in the root directory or the filename is not all lower case.) But with Google not only indexing the content of the files but also saving cached versions, this opens up some interesting applications for searching for sites that exclude specific bots and also to track changes in a robots.txt file for a specific site by comparing the cached version to the current version.

How long this may remain available will depend on whether this was intentional on Google's part or simply a mistake. Since some of the KWIC extracts (snippets) show some code such as
<html><head></head><body><pre>
that are not actually in the original files, I suspect that it may be either a mistake or that it just still has some bugs that need to be worked out.

Dated Jul 9, 2003 in AlltheWeb | Archived Pages | Google


rss Subscribe