Search Engine Showdown
 

« MyWay Goes to Multiple Search Engines | Blog Home | Two New Alerts for Web Searches »

Hunting for Google's Cache

Finding Google's cached copy is not always trouble free. Take the recent example of an interesting story of journalistic confusion gets even more confused. Apparently, a Computerworld reporter was fooled into believing that terrorists claimed responsibility for the recent "Slammer" worm. The original story was posted online but now states "Computerworld removed this story due to questions about its authenticity. An update about this situation has been posted."

So what does this have to do with Google's cache? Well, other reporters thought they might find the original story from Google's cache. Google Village, in their story Google Everflux Misses Slammer Terror states that "Google is good at getting the fresh stuff each day, but not good enough to capture a page, and cache it after such a page has appeared for a few hours." And The Register reports that the story "doesn't seem to have been around long enough to make it into Google cache."

Well, I beg to differ. For as long as it lasts, take a look here. Presumably, the reporters tried a search like cache:www.computerworld.com/securitytopics/security/virus/story/0,10801,78219,00.html which currently gives no results. If they had gone one step further and clicked on the "News" tab, they would have found the cached file. Note that the cached copy is missing the usual surrounding text and graphics. I think this is due to the way Google identifies news articles for indexing, leaving out the navigational and other surrounding text. Google News search results do not display a link to a cached copy of the story, but apparently they are there anyway. And in case the cached copy disappears from Google, I have a copy on my site.

Oh, and while I'm on the topic, I've noticed some other oddities with Google's cache. Google has two rather distinct crawls: the regular GoogleBot crawl, sometimes called DeepBot, and a smaller one that focuses on frequently refreshed content. The latter often called the FreshBot. Results from FreshBot usually have a date listed before the "Cached" link. These two crawls can have two separate cached copies at Google. For example, a search on lisnews today finds the top hit with a date of "Feb 10, 2003." Click on the "cached" link, and the latest story is actually from Feb. 9. But a direct search for cache:www.lisnews.com pulls up a page cached Jan. 11. Both pages are searchable in Google's index. But for hardcore cache users, the point is that there are two versions of the page accessible from Google, if you are willing to do a little digging.

Dated Feb 10, 2003 in Archived Pages | Google


rss Subscribe