July 2003 Archive
Finally, another search engine offers searchable access to about as many file types beyond PDFs as Google does. AlltheWeb has expanded from indexing just PDF, Microsoft Word, and Flash files (beyond the HTML of most Web pages) to include Rich Text Format (rtf), PowerPoint (ppt), Excel spreadsheets (xls), PostScript (ps), and even WordPerfect (wpd) and StarOffice (sdd and sdc) files. There may be more besides these. To limit to one of these new file types, use the
filetype: command followed by the name of the file format. Others use the extensions, so note the difference with the command at AlltheWeb.
Here are the new filetypes that have worked for me:
HotBot has changed three of the four names for the search engines it searches on its site.
Inktomi is now just "HotBot"
FAST is now called "Lycos"
Teoma now shows as "Ask Jeeves"
Each of the four now offers spelling suggestions as well. And related searches can be displayed at the top if you turn them in the preferences section under "Results Prefs."
Google has finally added an advanced search page for its news database. It includes options for sorting by date, specifying the news source, a location limit, a date limit, and field searches for headline, body, and URL.
A few minor updates at Gigablast announced today include better keyword highlighting. The default of an OR on multi-word searches remains, unlike all major search engines and hearkening back to search engines of the late 1990s. However, they now put a teal bar at top or search engine results pages where the default OR was used which links to an explanation and states
"The results below may not have all your query terms, but may be relevant. Try generalizing your query. [Info]"
Sorry, but I think Gigablast just needs to default to AND like most people expect and the major search engines all do. I find the default OR frustrating enough that I will skip a try at Gigablast just for that reason sometimes. Also, Gigablast announced that "When returning a page of search results Gigablast lets you know how long ago that page was cached by displaying a small message at the bottom of that page." However, you only see that if someone else has done that same search recently. These are different from the dates at the top of the cached page, and Gigablast still does a far better job than any other search engine at honestly stating when they crawled a Web page and the date reported at that time.
Yahoo! announces today that they are acquiring Overture, known for its highly profitable ads, ranked by the highest bidder. And Overture earlier this year bought up AltaVista and AlltheWeb. At a price of approximately $1.63 billion in cash and stock, Yahoo! expects to close the deal by the fourth quarter of 2003.
So Yahoo! will own the Inktomi, AltaVista, and AlltheWeb and FAST Web Search properties, three of the major Web search engines. Yet currently Yahoo! still uses Google for the majority of its search results. That should be changing sometime soon, but whether they will combine the three, use only one, and what will happen with the AltaVista and AlltheWeb search sites and advanced capabilities and syntax, no one is saying.
And who's left outside of Google and the Yahoo! group with their own custom build databases? Ask Jeeve's Teoma, LookSmart's struggling WiseNut, and the newcomer (from last summer) Gigablast. Well the consolidation predicted to happen about five years ago is finally occurring. Let's hope that search will still continue to improve, expand, and offer even more options and resources.
For more than a month now, the intitle: and inurl: field searches have been broken. I first heard of this on May 27, 2003. The advantage of intitle: and inurl: over the advanced search page Occurrences section or the allintitle: and allinurl: field searches was that they applied to only a single term and could be combined with other search terms that would look through the record. So now, searchers can not do a search that looks for one word in the title and another in the body. A search that tries like "market research" intitle:tourism retrieves many results that do not include 'tourism' in the title.
At first I thought this was a temporary glitch from the strange May update, but it has persisted through the June update and has continued for some time. Hopefully it will be correct sometime soon. I've updated the Google Inconsistencies page with this problem and several others long term problems.
In addition, I updated several parts of the Google Review, including the addition of several language limits added in early 2002 that I had missed: Croatian, Indonesian, Serbian, Slovak, and Slovenian.
It appears that Google's spider is not only checking robots.txt files, it is also indexing and even caching some of them. Try a search on
allinurl:robots.txt to see some examples, or see the cached copy of the Salon.com file.
It would be interesting to know why they are doing this. Other search engines, like AlltheWeb will index robots.txt files that do not follow the protocol as in the search for
disallow user-agent url.all:robots.txt. (The results either have the robots.txt file not located in the root directory or the filename is not all lower case.) But with Google not only indexing the content of the files but also saving cached versions, this opens up some interesting applications for searching for sites that exclude specific bots and also to track changes in a robots.txt file for a specific site by comparing the cached version to the current version.
How long this may remain available will depend on whether this was intentional on Google's part or simply a mistake. Since some of the KWIC extracts (snippets) show some code such as
that are not actually in the original files, I suspect that it may be either a mistake or that it just still has some bugs that need to be worked out.
According to today's press release, LookSmart results will begin appearing on Lycos sometime later this summer. It sounds like Lycos will give ads from Overture, LookSmart ads, and then other results from the FAST database.
Ask Jeeves has completed the sale of its enterprise search, announced on May 28 to Kanisa. In conjunction with the original announcement, the company stated that "Ask Jeeves will now focus on its core competency of Web-wide search."