Search Engine Showdown
 

Archived Pages Category Archive

More Caches: Japan and China

So after learning yesterday that the Russian search engine Yandex (Яндекс) cached pages, I started looking at a few other well-known, non-English search engines. Baidu, the Chinese search engine has just expanded into Japan with a Japanese Baidu. Both of these also have cached copies of pages. At Baidu, look for the 百度快照 links after the URL (similar to Google's placement). For the Japanese version, the cache linked is キャッシュ in a similar location. For us non-Chinese and non-Japanese speakers, is there any use in these? Well, they are one more source for archived versions of pages, including English-language ones....

See more »

[full story] dated Jan 27, 2008 in Archived Pages

A Russian Cache

I do not usually spend much time with country-specific search engines, especially those in languages I do not speak. Even with English-language country-specific search engines, the general search engines usually have more comprehensive results and better search functionality. So when Phil posted about the Russian search engine Yandex (Яндекс), I just thought I'd take a quick look. Something piqued my curiosity, and I tried a few of the links. Sure enough, Yandex caches copies of many of the pages that it indexes. Look for the Сохраненная копия link at the bottom left of a search result record as in the...

See more »

[full story] dated Jan 26, 2008 in Archived Pages

HereUAre, Gigablast, 10 Billion, and Spam

Ever heard of HereUAre, which has "Over 10 billion pages indexed?" Try a search and you may recognize the results as coming from Gigablast. So what's the connection? This leads to a rather strange story of a vanished press release that I've been researching on and off for the past month or so. Here's the story. In trying to update my site awhile back, I came across one page that linked to a June 19, 2006 press release from Gigablast about a database size increase to 10 billion and a new "report as spam" feature. The linked page (beta.gigablast.com/prnew.html) was...

See more »

New Exalead Interface Launches

Exalead has launched its new interface that has been in beta and preview mode for the past month or so. They also report indexing over 8 billion pages (they had initially stated they would meet that goal in the summer, so they are not too far behind schedule). Exalead Review has been updated....

See more »

[full story] dated Oct 10, 2006 in Archived Pages | Exalead

Why Search More than One?

Here is another example of a search I ran today where several search engines failed to give me the answer I needed. In particular, I was looking for a cached copy of a Web page, since the page was unavailable when I tried to view it. Three search engines failed to have any record of the page, but fortunately, that last one I tried had the page indexed and a cached copy available for me to view. The winner? Live Search. The losers? Ask, Yahoo!, and Google....

See more »

[full story] dated Sep 23, 2006 in Archived Pages | Overlap

Wikipedia as Source?

An interesting posting today first claimed that the U.S. Dept. of State shamelessly stole text from the Wikipedia: At this point some of you may ask just what the heck the US Dept. of State was doing, but let's take a moment to clear things up. First, it's obvious the Wikipedia page has been around for quite some time, and has evolved from that older state. . . . the US Dept. of State page doesn't even mention Wikipedia I find this posting fascinating in that some people assume that the Wikipedia is an old, established resource. Obviously, the author...

See more »

Yahoo! Adds Wayback Links to Cache

According to Gary, Yahoo! has expanded its cache option by providing links to old versions of Web pages via The Internet Archive's Wayback Machine. The link is in the header of the Yahoo! cached page copy. Gary notes that both Gigablast and Clusty offer links to the Wayback Machine as well....

See more »

[full story] dated Sep 18, 2005 in Archived Pages | Yahoo!

MSN Search Beta Release

Following up on the previous test technology preview release, MSN has launched its new, unique search engine database at beta.search.msn.com. As opposed to the previous tests, this version has advanced search features under the "Search Builder" link including site limits, link searches, selected country and region limits, 12 language limits, and three slider bars for adjusting the ranking. The beta version results page includes links to MSN's own cached pages, providing yet another source for cached copies. The cached versions' headers note the date the page was last indexed, but they do not highlight search terms. Nested Boolean searching is...

See more »

Cache Date Back in Google

Back in the depths of Google's history, their cached copy of Web pages included two dates: the date when Google crawled the page and the reported date stamp on the page at that time. Then, both dates disappeared as Google realized that they showed how old some parts of their database was. Now that they have greatly increased the freshness of their database and revisit more pages more frequently, they have finally added back some date information. The top line in the cache now gives the date Google last crawled the page. It is a welcome and useful addition....

See more »

[full story] dated Jul 29, 2004 in Archived Pages | Google

Google Adds Text Cache Version

Google has added a text only cache version. After displaying a regular cached page, look in the header for a "Click here for the cached text only" link to see the cached page with just the text and without any images. This is discussed in more detail in a Search Engine Watch forum posting....

See more »

[full story] dated Jun 30, 2004 in Archived Pages | Google

New Search at Yahoo! (Drops Google)

As has long been expected, Yahoo! has announced the launch of its own search engine database and dropped Google. After using AltaVista, then Inktomi, and then Google to deliver search results after directory listings (and now that they own Inktomi, AltaVista, and AlltheWeb), Yahoo! now uses its own database. It appears to be primarily from Inktomi, but its results differ from MSN Search and HotBot which also use Inktomi. Several positive comments at first look: It still has cached copies of pages It is a large database, sometime finding more than Google Most advanced search features still work This launch...

See more »

PDFs on Gigablast

Matt Wells of Gigablast announces that "Gigablast now indexes PDF documents." To limit a search to PDF files, Gigablast uses a different command than the other search engines: Use type:pdf rather than the more standard 'filetype:'. To exclude PDF files, add type:text to a search. Matt also says that Gigablast "will support other file types in the future." Gigablast review updated. But remember, Gigablast defaults to OR, so a search like nutrition type:pdf is actually looking for any page with 'nutrition' OR and PDF file. The nutrition search finds zero results with both. To force it to work as expected,...

See more »

[full story] dated Aug 14, 2003 in Archived Pages | Gigablast

Indexing Robots.txt Files

It appears that Google's spider is not only checking robots.txt files, it is also indexing and even caching some of them. Try a search on allinurl:robots.txt to see some examples, or see the cached copy of the Salon.com file. It would be interesting to know why they are doing this. Other search engines, like AlltheWeb will index robots.txt files that do not follow the protocol as in the search for disallow user-agent url.all:robots.txt. (The results either have the robots.txt file not located in the root directory or the filename is not all lower case.) But with Google not only indexing...

See more »

New Reference, Archives Pages

I've finally updated the listings on my Other Internet Search Tools page which covers searchable sources for articles, forums, email lists, blogs, etc. Two new pages linked from there are the Reference Search Tools covering just a few selected free online reference tools and the Archives page with sources for cached copies of Web pages and other ways to find old or dead pages....

See more »

Google News Loses Functionality

In Google News you used to be able to use advanced syntax like cache: followed by a URL to pull up a cached news story or site: to limit to a specific publication. Now these syntax no longer work and Google says "site:nytimes.com was dropped from your search because it is not supported for this type of search." For title searching, intitle: still works. Instead of site: try using source: which should be followed by either the single word for the source title that Google shows in green or for multiple word sources, use an underscore (_) character in between...

See more »

Hunting for Google's Cache

Finding Google's cached copy is not always trouble free. Take the recent example of an interesting story of journalistic confusion gets even more confused. Apparently, a Computerworld reporter was fooled into believing that terrorists claimed responsibility for the recent "Slammer" worm. The original story was posted online but now states "Computerworld removed this story due to questions about its authenticity. An update about this situation has been posted." So what does this have to do with Google's cache? Well, other reporters thought they might find the original story from Google's cache. Google Village, in their story Google Everflux Misses Slammer...

See more »

[full story] dated Feb 10, 2003 in Archived Pages | Google

New! GigaBlast in Beta

GigaBlast launched in beta today. While much smaller than the recently launched Openfind, it offers some nice advantages. It includes cached copies of the pages it indexes, like Google. It includes an advanced search, date sorting, field searching, and excellent reporting of both the date spidered and the last modified date. It does lack full Boolean, truncation, and other advanced search features. See the Search Engine Showdown review for more on its search features....

See more »

Wayback Machine Debuts

The Internet Archive launches their Wayback Machine at web.archive.org. The Wayback Machine provides access to Web pages the way the used to look, and how they looked on particular dates when the Internet Archive visited them. This is like the Google cache on steroids, since it contains not only text but images from the past as well. And rather than just archiving one version of the page, it contains multiple versions. The entire archive is not keyword searchable. Instead, access is by URL and then by date....

See more »

rss Subscribe