November 2002 Archive
Gary Price reports that "You're now able to limit your search to a specific site for stories available via Google News. In other words, the site: syntax now works." He includes several examples. Maybe they will eventually add an advanced search page as well.
Well it was up for awhile, but it has been inaccessible today, so the previous announcement may have been premature.
Daypop is back up and running at last. If you've been wait awhile to try it out, take a look now. Dan Chan appears to have got it back up late Thursday night, although it may be a bit less dependable than before since Dan posted that "After being spoiled with a big block of IPs for my previous business class line, I had to figure out how to run the servers off residential DSL at my new home in the Bay Area."
Inktomi has released its Web Search 9. Whether or not, and how soon, its partners will implement some of the new features remains to be seen. Key points of the new launch include:
- A new 3 billion record database
- PDFs and other file types
- Spelling suggestions, for English only, but for names as well as dictionary terms
- An even fresher database, claiming to re-index its entire database every 10-14 days and paid inclusion URLs every 48 hours
- Smart summaries where Inktomi will either display a contextual summary (KWIC display), an editorial summary, or an advertiser-supplied summary for paid-inclusion customers
- And more for Index Connect partners
Beware of diacritics in search terms. The various search engines handle them in different ways. Take a diacritic like e with an acute accent 'Ã©.' Will an 'e' match on both the 'e' as well as the e acute? At Google, searches will only be an exact match. At AltaVista and other search engines, the plain 'e' will match on both (and other 'e' diacritics). For example, a Google search for "epistolaires de mari" found only one hit while "Ã©pistolaires de mari" found more than a dozen. At AltaVista and other search engines, "epistolaires de mari" finds all the diacritic permutations. I hope to do a more in-depth diacritic showdown in the future, but for now, the lesson is to search both with and without the diacritics at Google for the most comprehensive search.
Inktomi announced the sale of its enterprise search software (formerly known as Ultraseek) to Verity. They also announced a new focus on Web searching which is now the major business that Inktomi is still involved with. Note that they say "The company plans to announce significant enhancements to Inktomi Web Search next week that will further increase relevance and freshness." Despite losing several partners this year, this move shows their confidence in being able to make money in the Web search business. Stay tuned for the announcements next week. Inktomi review updated. See also the Verity press release and their conference call. The acquisition does include both the Ultraseek enterprise search engine originally developed at Infoseek and the recently acquired Quiver automatical classification software.
Teoma now can use an OR and is supposed to do phrase searching correctly. Previously the phrase searching was only approximate. The OR must be in all upper case letters, but without nesting the processing of a simple
x y OR z gets treated as (x AND y) OR z which is not what most people would expect. Also, the spell check is now in beta for common English words only (not names), and they are asking for feedback on it. I have updated my Teoma review (and my features chart and the list by feature page) to reflect the changes.
AltaVista, one of the few surviving old-time search engines, is trying another relaunch. Originally scheduled for Nov. 12, the new site has been up since this weekend with a mixture of nice improvements and some failures. See their quick tour for their PR push.
The advanced search has a File Type limit and filetype:pdf also works. This is a substantial quality increase for their database. However, like at Google, AltaVista only indexes the beginning part of the PDF files. For example, one 228 page PDF is only indexed up to about page 120. Only FAST (at AlltheWeb and Lycos) is indexing full PDF files.
They have certainly improved the freshness of their database as a whole. In my freshness comparison of Oct. 20, the bulk of their database was about 3 months old. Now it looks more like it is from mid September with a few pages from the last three days. That is a big improvement. However, their claim of refreshing "50% of the results daily" is a bit misleading. They plan on revisiting half of the results that users retrieve and refreshing those. That should mean that about half of the results that most users see will be fresh, but it is not half the whole database. The fresh results will be marked with "Refreshed in the past 24 hours" or "Refreshed in the past 48 hours" which is a more accurate label than Google's date (since it only represents that date the page was last checked and not necessarily the date when the content last changed).
After crawling about 4 billion URLs, their production database is about 900 million. (Roughly 20 million of these are supposed to be refreshed daily). It is nice to see a larger database, but they still miss many pages available from Google and AlltheWeb. AltaVista also now has 400 million Web objects (images, audio files, and video files). The image database is supposed to be increasing from about 100 million to 250+ million images.
Like at Google, AltaVista says that over 50% of its traffic is from non-North American users. So they are focusing on expanding and improving how the site serves all their users. So US users see a US version while German users have a version specialized for them. If you get a version you do not want, click on the AltaVista [country] link in the upper right hand corner to change the default. For the US version, it defaults to searching English and Spanish language pages only. Be sure to change that if you want a broader searcher, and AltaVista will remember the change. The US regional limit defaults to Worldwide, but there is a US limit available. Other countries default to their own country limit, which is based on both top level domains and link analysis to determine the geographic orientation of particular pages. The region limit on the advanced search page is now gone.
Their Prisma technology for suggesting related and narrower searches has been widened from English to include French, German, Italian, and Spanish as well. Their News search has also been expanded from English to include German language sources for their German version.
Power Search Back as More Precision
With all the turnover at AltaVista, it seems no one there remembers their old Power Search. Now they have brought back some semblance of it under the label of "More Precision." All it offers All, Any, Phrase, and None choices for those who don't do Boolean searches. There is also the default region and language limits, but the Advanced Search still offers more options.
Following up on their August removal of the annoying pop-up and pop-under ads, now their home page has no banner ads either.
A new expansion on their Shortcuts is call Shortcut Answers. The Shortcuts, an effort to get at material on the invisible Web, are marked with a small boxed arrow and just above the regular results. The new expansion tries to provide answers rather than just links. So a search on exchange rate zloty or area code dallas gives answers directly on the results list. For more details, see their Shortcuts Help page.
Some early problems with the new launch are getting fixed fairly quickly. For example, the directory search would give directory category matches, but the links were broken. That is now fixed. However, it would be even better if AltaVista gave the full category label at the top of the category results when clicking on the category links from a search result. Also, some other parts of the site did not seem to work right but are back to normal now.
Internet Explorer Optimization
Some features only work or work best in Internet Explorer. The vertical blue bar down the left side of the page can be clicked to open the search result in a new window. It is a nice touch, but only in IE does the highlight in the blue bar show up when you mouse over the record. Click on the title of a record shows an "opening page..." note in green, and after returning to the results page a "last page visited" in green will show up next the record just viewed. Nice features, but both work only in IE. And then there is the search button, renamed "Find" for some unknown reason. The button color only shows up in IE. In Mozilla, the 'Find' text is white on grey. At least it is black text in Netscape 4.7 and Opera.
New News Search
The AltaVista news search, with content from Moreover, New York Times, BBC, CNN, Forbes, and others, is expanded and now provides limits for Regions, Sources, and Date, in addition to the Topic limit available earlier. It also has adding the Prisma suggested searches technology and will add news pictures as well. Most significant to me is that some of these news sources have stories indexed from a year or more ago, well beyond the month at Google news.
AltaVista says that internal tests show a 40% user satisfaction improvement over the past few months. I've seen mixed results, but also note Danny Sullivan's article from last week, Paid Inclusion Listings May Get Boosted At AltaVista. Let's hope that this is only a temporary glitch.
The main Google page claim has jumped from 2,469,940,685 web pages to 3,083,324,652 web pages. To get their number over 4 billion, they add in the 330 million images in their image database and the "nearly 800 million Usenet newsgroup postings" in Google Groups. The image number has remained static since Dec. 2001, but the Usenet postings have grown from 700 million then to "nearly" 800 million.
So what about their basic Web page growth? I am not sure what they are counting. On a few quick tests that I ran, Google did not seem to find that many more results than they did last March, and in some cases, they actually found less. It may be that the unindexed URLs and duplicates have increased substantially, but I have been unsuccessful in getting Google to comment on that. According to Googles Nate Tyler, "more than 40 percent of these 3 billion web pages are authored in non-English languages" and "more than 50 percent of Googles traffic comes form overseas," so perhaps much of the claimed growth comes in that sector.
The Wayback Machine has announced the official launch of its "document compare" feature which uses DocuComp technology to compare two historical Web pages and highlight the differences. Look for the "Compare Archive Pages" in tiny print in the upper right hand corner after the search box on a search results page to try out this feature. See their FAQ for more information.
New search engine Gigablast is starting to expand. Yesterday they launched a Swedish/Scandinavian version at gigablast.nu. It looks like the same database with the addition of a Swedish language limit. The design of the site is much nicer than gigablast.com but the advanced search does not have all the same options. They are accepting advertising which may well help to support continued development of Gigablast. More information is available on their About page.
Overture has stopped displaying the bid price on their results page. Formerly, if you ran a search at overture.com, you could see who had bid on the term and how much. That information is still available, but it is now only viewable from the "View Advertisers' Max Bids" link in the upper right hand corner. You have to re-enter your search and then add the security code displayed in a graphic. It is somewhat like AltaVistas submission code on their basic add URL service and probably is designed to prevent automated bid checking.
Danny Sullivan and Chris Sherman have published the results of their search engine torture test, or as they now call it, the "Perfect Page Test." For even more details, see their Criteria and Detailed Results page. It is a fascinating analysis, but bear in mind that the is only measuring results for one kind of search -- where there is one perfect page for an answer. Many searches do not have such an answer and the relevance of the results from any search engine could be quite different on those searches.
Teoma has updated their database, expanded it by 60% (to about 350 million records after crawling 750 million), and added several new search features. Eventually there should also be an advanced search page and spelling suggestions. What's new?
- Site collapsing: first two hits per domain are listed with others under a "More results from . . . " message (even if there are only two).
- KWIC (keyword-in-context) display
- Stop words are searched within phrases
- New field searches
- Language limits
An interesting article on News.com "The Google Gods: Does Search Engine's Power Threaten Web's Independence?" quotes Gary Price of The Virtual Acquisition Shelf and News Desk fame.
AlltheWeb has a new Halloween skin and has announced that it is now fully XHTML and CSS compliant. The Halloween and other skins can be seen and installed in their skins gallery but do require 5.0 and later browsers (i.e. not Netscape 4.7). Beyond design changes, there is little of interest to searchers here, but note that "these new standards provide a future opportunity for AlltheWeb users to perform their searches on platforms such as mobile phones, personal digital assistants (PDAs), among others."