Archived Pages Category Archive
Google Operating System reports in Bring Back Keyword Highlighting to Google Cache that logged in users have lost the ability to have search words highlighted when looking at Google's cached copy of a web page. "For some reason, Google's caching feature is more and more difficult to use. The "cached" link is hidden inside the Instant Preview box and it's no longer available in the mobile interface. Now the keywords from cached pages aren't highlighted if you are logged in."
I had not realized that cached links were gone from the mobile interface. While I don't always use the highlighting when looking at cached pages, if needed, at least one option is to re-do the search in a non-logged-in browser or use the URL modification trick mentioned in the blog post. Since no keywords are highlighted, the header line of "These search terms are highlighted: . . . " is also missing. Also, as a reminder, the cache used to mention which search terms only showed up in links to the page. Now if you are not logged in and a search word does appear on the page, it is just not mentioned.
I love cached copies of web pages for many reasons, but with the consolidation of search engines, the number of sources for a cached page has been decreasing. So imagine my dismay when I noticed the lack of links to a cached page copy at Bing recently. Fortunately, they are still there. The links are just a bit harder to find.
Read the rest of this post and watch the screencast to see how to find the new Bing cache link location:
So after learning yesterday that the Russian search engine Yandex (Яндекс) cached pages, I started looking at a few other well-known, non-English search engines. Baidu, the Chinese search engine has just expanded into Japan with a Japanese Baidu. Both of these also have cached copies of pages. At Baidu, look for the
百度快照 links after the URL (similar to Google's placement). For the Japanese version, the cache linked is
キャッシュ in a similar location.
For us non-Chinese and non-Japanese speakers, is there any use in these? Well, they are one more source for archived versions of pages, including English-language ones. For example, a search on library of congress (in English) finds hits at both. Here is a screenshot the Chinese version with the cache link in gray at the end.
And here's the Japanese version, with the cached link again in gray at the end. I'll be adding both of these to my Finding Old Web Pages page.
I do not usually spend much time with country-specific search engines, especially those in languages I do not speak. Even with English-language country-specific search engines, the general search engines usually have more comprehensive results and better search functionality. So when Phil posted about the Russian search engine Yandex (Яндекс), I just thought I'd take a quick look. Something piqued my curiosity, and I tried a few of the links. Sure enough, Yandex caches copies of many of the pages that it indexes. Look for the
Сохраненная копия link at the bottom left of a search result record as in the screen shot below.
Yandex's cache does not include a date, at least that I could identify, but from a few tests, it seems that the cached page may be quite recent (the day before) to several months old. I've added Yandex to my Finding Old Web Pages page.
Ever heard of HereUAre, which has "Over 10 billion pages indexed?" Try a search and you may recognize the results as coming from Gigablast. So what's the connection? This leads to a rather strange story of a vanished press release that I've been researching on and off for the past month or so. Here's the story.
In trying to update my site awhile back, I came across one page that linked to a June 19, 2006 press release from Gigablast about a database size increase to 10 billion and a new "report as spam" feature. The linked page (beta.gigablast.com/prnew.html) was no longer live. I did find a cached copy of the page, from Sept. 10, 2006 only at MSN Search. (No cached copy were available on Oct. 8 at Google, Yahoo!, Ask, or the Wayback machine.) Fortunately, when I came across, I FURLed the MSN Search cached copy of the page. In checking today, I could not find a cache or link at any of the main search engines. Since FURL saves a copy of the page, I have the text from the press release. I'm glad I did, since I could not find a cached copy of the page at Live or any of the other search engines today when I checked.
To summarize the release, Gigablast now has a database with over 10 bilion pages, and here is where it calls it the "HereUAre search engine." It also mentions a beta (no longer available), "multi-language support, real-time indexing, and improved spam control." One part of the spam control is that at the end of each search result, Gigablast now has a link labeled "[report as spam]." Click that link on to report an entry as spam. The Gigablast site does not have the 10 billion claim on it, although it does continue to have the [report as spam] links. The HereUAre site does have the 10 billion claim and the spam reporting. It also makes it sound as if the search technology is its own, with no mention of Gigablast. I was also surprised that I found no mention of HereUAre, the Gigablast 10 billion, or the spam report at other search engine news sites. So, I'm posting what I've found out, and in the interest of sharing information, is a copy of MSN's cached copy of the press release.
Exalead has launched its new interface that has been in beta and preview mode for the past month or so. They also report indexing over 8 billion pages (they had initially stated they would meet that goal in the summer, so they are not too far behind schedule). Exalead Review has been updated.
Here is another example of a search I ran today where several search engines failed to give me the answer I needed. In particular, I was looking for a cached copy of a Web page, since the page was unavailable when I tried to view it. Three search engines failed to have any record of the page, but fortunately, that last one I tried had the page indexed and a cached copy available for me to view. The winner? Live Search. The losers? Ask, Yahoo!, and Google.
An interesting posting today first claimed that the U.S. Dept. of State shamelessly stole text from the Wikipedia:
At this point some of you may ask just what the heck the US Dept. of State was doing, but let's take a moment to clear things up. First, it's obvious the Wikipedia page has been around for quite some time, and has evolved from that older state. . . . the US Dept. of State page doesn't even mention WikipediaI find this posting fascinating in that some people assume that the Wikipedia is an old, established resource. Obviously, the author did not know that the State Department has been producing Background Notes for decades. Certainly, most librarians reading this will guess correctly that the Wikipedia grabbed the text from the State Department originally, and not vice versa.
The page also (somewhat) demonstrates how sometimes, the social, self-correcting nature of the Web can fix such mistakes. After its initial posting, the author added an update at the end and a "Read the update at the bottom, old article preserved for amusement potential only!" at the top. The update does note that
Some people did some great digging and found a copy of the original US Dept. of State document. And guess what? It just barely predates the Wikipedia page.but "just barely" still hints at the lack of understanding of the preceding print versions of the Background Notes.
Anyway, there is also an interesting Web search connection here. I first came across the page after the update had been added. I wanted to see an earlier version, but based on internal content, I could guess that the original had just been posted earlier today. In trying to find the older version, I knew it was too recent for the Wayback Machine. Instead, I checked for cached copies at the search engines. Yahoo! indeed had indexed it, and their cached copy was the earlier version. Out of curiosity I checked at Google, MSN, Ask, and Gigablast. None of the other search engines had yet indexed the page. Once again, the answer to my question was found by one search engine, and in this case, not by Google.
According to Gary, Yahoo! has expanded its cache option by providing links to old versions of Web pages via The Internet Archive's Wayback Machine. The link is in the header of the Yahoo! cached page copy. Gary notes that both Gigablast and Clusty offer links to the Wayback Machine as well.
Following up on the previous test technology preview release, MSN has launched its new, unique search engine database at beta.search.msn.com. As opposed to the previous tests, this version has advanced search features under the "Search Builder" link including site limits, link searches, selected country and region limits, 12 language limits, and three slider bars for adjusting the ranking. The beta version results page includes links to MSN's own cached pages, providing yet another source for cached copies. The cached versions' headers note the date the page was last indexed, but they do not highlight search terms. Nested Boolean searching is supported with the +, -, and | symbols as well as with AND, OR, NOT operators which must be in uppercase. Phrase searching with "double quotes" is available as are the site:, link:, language:, url:, and location: command line options. With this launch, MSN claimed a database size of 5 billion pages, prompting Google to increase its count from 4.2 to 8 billion pages just hours before MSN launched this beta. In conjunction with the beta launch, MSN has also launched an MSN Search blog.
Back in the depths of Google's history, their cached copy of Web pages included two dates: the date when Google crawled the page and the reported date stamp on the page at that time. Then, both dates disappeared as Google realized that they showed how old some parts of their database was. Now that they have greatly increased the freshness of their database and revisit more pages more frequently, they have finally added back some date information. The top line in the cache now gives the date Google last crawled the page. It is a welcome and useful addition.
Google has added a text only cache version. After displaying a regular cached page, look in the header for a "Click here for the cached text only" link to see the cached page with just the text and without any images. This is discussed in more detail in a Search Engine Watch forum posting.
As has long been expected, Yahoo! has announced the launch of its own search engine database and dropped Google. After using AltaVista, then Inktomi, and then Google to deliver search results after directory listings (and now that they own Inktomi, AltaVista, and AlltheWeb), Yahoo! now uses its own database. It appears to be primarily from Inktomi, but its results differ from MSN Search and HotBot which also use Inktomi. Several positive comments at first look:
- It still has cached copies of pages
- It is a large database, sometime finding more than Google
- Most advanced search features still work
Matt Wells of Gigablast announces that "Gigablast now indexes PDF documents." To limit a search to PDF files, Gigablast uses a different command than the other search engines:
type:pdf rather than the more standard 'filetype:'.
To exclude PDF files, add
type:text to a search. Matt also says that Gigablast "will support other file types in the future." Gigablast review updated.
But remember, Gigablast defaults to OR, so a search like
nutrition type:pdf is actually looking for any page with 'nutrition' OR and PDF file. The nutrition search finds zero results with both. To force it to work as expected, remember to add the + symbol, as in
The search results display gives a big PDF logo in front of all the PDF files, but most do not include extracts. That makes it hard to determine what the file is about since many PDF file names are not very helpful. On the plus side, Gigablast is the only other search engine other than Google than includes an HTML version of the PDF. Click on the [cached] link after any PDF to see the HTML version used for indexing.
It is great to see this included on Gigablast, especially for the cache availability. But in several quick searches, most of the PDFs are fairly short ones. I found few from .gov sites as well. So the underlying database needs to expand, but this is a great start.
It appears that Google's spider is not only checking robots.txt files, it is also indexing and even caching some of them. Try a search on
allinurl:robots.txt to see some examples, or see the cached copy of the Salon.com file.
It would be interesting to know why they are doing this. Other search engines, like AlltheWeb will index robots.txt files that do not follow the protocol as in the search for
disallow user-agent url.all:robots.txt. (The results either have the robots.txt file not located in the root directory or the filename is not all lower case.) But with Google not only indexing the content of the files but also saving cached versions, this opens up some interesting applications for searching for sites that exclude specific bots and also to track changes in a robots.txt file for a specific site by comparing the cached version to the current version.
How long this may remain available will depend on whether this was intentional on Google's part or simply a mistake. Since some of the KWIC extracts (snippets) show some code such as
that are not actually in the original files, I suspect that it may be either a mistake or that it just still has some bugs that need to be worked out.
I've finally updated the listings on my Other Internet Search Tools page which covers searchable sources for articles, forums, email lists, blogs, etc. Two new pages linked from there are the Reference Search Tools covering just a few selected free online reference tools and the Archives page with sources for cached copies of Web pages and other ways to find old or dead pages.
In Google News you used to be able to use advanced syntax like cache: followed by a URL to pull up a cached news story or site: to limit to a specific publication. Now these syntax no longer work and Google says "site:nytimes.com was dropped from your search because it is not supported for this type of search." For title searching, intitle: still works. Instead of site: try using source: which should be followed by either the single word for the source title that Google shows in green or for multiple word sources, use an underscore (_) character in between the words as in
source:new_york_times. Google News could really use an advanced search form and the restoration of the cached copies.
Finding Google's cached copy is not always trouble free. Take the recent example of an interesting story of journalistic confusion gets even more confused. Apparently, a Computerworld reporter was fooled into believing that terrorists claimed responsibility for the recent "Slammer" worm. The original story was posted online but now states "Computerworld removed this story due to questions about its authenticity. An update about this situation has been posted."
So what does this have to do with Google's cache? Well, other reporters thought they might find the original story from Google's cache. Google Village, in their story Google Everflux Misses Slammer Terror states that "Google is good at getting the fresh stuff each day, but not good enough to capture a page, and cache it after such a page has appeared for a few hours." And The Register reports that the story "doesn't seem to have been around long enough to make it into Google cache."
Well, I beg to differ. For as long as it lasts, take a look here. Presumably, the reporters tried a search like cache:www.computerworld.com/securitytopics/security/virus/story/0,10801,78219,00.html which currently gives no results. If they had gone one step further and clicked on the "News" tab, they would have found the cached file. Note that the cached copy is missing the usual surrounding text and graphics. I think this is due to the way Google identifies news articles for indexing, leaving out the navigational and other surrounding text. Google News search results do not display a link to a cached copy of the story, but apparently they are there anyway. And in case the cached copy disappears from Google, I have a copy on my site.
Oh, and while I'm on the topic, I've noticed some other oddities with Google's cache. Google has two rather distinct crawls: the regular GoogleBot crawl, sometimes called DeepBot, and a smaller one that focuses on frequently refreshed content. The latter often called the FreshBot. Results from FreshBot usually have a date listed before the "Cached" link. These two crawls can have two separate cached copies at Google. For example, a search on lisnews today finds the top hit with a date of "Feb 10, 2003." Click on the "cached" link, and the latest story is actually from Feb. 9. But a direct search for cache:www.lisnews.com pulls up a page cached Jan. 11. Both pages are searchable in Google's index. But for hardcore cache users, the point is that there are two versions of the page accessible from Google, if you are willing to do a little digging.
GigaBlast launched in beta today. While much smaller than the recently launched Openfind, it offers some nice advantages. It includes cached copies of the pages it indexes, like Google. It includes an advanced search, date sorting, field searching, and excellent reporting of both the date spidered and the last modified date. It does lack full Boolean, truncation, and other advanced search features. See the Search Engine Showdown review for more on its search features.
The Internet Archive launches their Wayback Machine at web.archive.org. The Wayback Machine provides access to Web pages the way the used to look, and how they looked on particular dates when the Internet Archive visited them. This is like the Google cache on steroids, since it contains not only text but images from the past as well. And rather than just archiving one version of the page, it contains multiple versions. The entire archive is not keyword searchable. Instead, access is by URL and then by date.