Search Features Category Archive
Earlier this month Google expanded the number of languages available in Google Translate. While the press release and most other coverage talked about ten new languages, the number of language pairs (from language X to language Y) increased far more substantially. Previously, Yahoo! Babel Fish had the most with 38 pairs. Google not only upped the number of possible languages, but every language listed can translate to the other. So depending on how you count, Google Translate now has over 500 language pairs available! That's a major increase. As Google Operating System notes, the counting varies depending on how you count Chinese. Only one choice is given for input of "Chinese," but Google Translate seems to accept both the Simplified or Traditional versions. Output can specify either Simplified or Traditional. So, if you count both versions of Chinese as one languages, this means Google Translate can machine translate 506 language pairs. If you consider that as two, it would be 552. And do note that you can input either version of Chinese characters and have it translated to the other.
Also note that Google has not only expanded its machine translation abilities but has augmented its Translated Search as well. Translated Search (also available on the Language Tools page as "Search Across Languages") will translate the query words and then display results in both the original language and in translation. Google translated search can machine translate query words and pages between the following languages. The following ten languages have been added along with the ability to translate between any of the possible language pairs.
Presumably, Google has been able to make such a major expansion of language translation pairs available by using statistical machine translation developed in house. This process is described in their FAQ: we feed the computer billions of words of text, both monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model." Moving to this approach certainly seems to have allowed such a major expansion. Bear in mind that all of this automatic translation is prone to error, although it should give some rough sense of the underlying meaning. I've updated my Online Translation and Translated Search pages with the new languages.
For some time now I have been speaking and writing about ways of speed searching and search switching. Somehow, I've neglected to add the links to my site. So I'm fixing that tonight, before my presentation on Wed. at CIL 2008. The new Search Switching page includes sections for Search Switching Between Web Search Engines, Geographic Search Switching, Book Search Switching, and other options including another link to my Bookmarklets page (with its search transfer bookmarklets).
See also my article, "Speed Searching," in the March 2008 issue of Online (available for fee at ITI's InfoCentral or free from many library databases such as AccessMyLibrary) and my article, "Switching Your Search Engines," from the May 2007 issue of Online (available from many library databases including AccessMyLibrary).
In recent months I have been speaking and writing about some of the language search and translation features of the search engines. Which search engine has the most language limits? Which online translator has the most language pairs? And which ones offers translated search? (Exalead, Yahoo!, and Google, respectively). So I've added a new Language Search Tools page, with links to pages about Language Limits, Online Translation, and Translated Search. I have also finally made a major update to my Search Engines by Search Features page and linked the language page from there.
For many searchers, especially those of us in the middle of a fairly mono-lingual part of the U.S., the language tools may have little appeal, but even in the midst of Montana, I still find times that I come across a non-English site, email, or term that can benefit from the use of these tools.
I've updated my search bookmarklets page due to changes from some of the search engines including Gigablast's new interface, finally changing msn.com to live.com in the code, and the addition of several new links:
- An animated .gif of the search transfer bookmarklets
- Moving the search transfer bookmarklets to the top
- A new bookmarklet for numbering Yahoo! results
- Updating the bookmarklet for numbering Google results
- Gigablast and Exalead search box links
A recent posting at ResourceShelf introduced two new sources of cached Web pages and reminded me to update my list of sources for archived/cached pages. I've added several other sources that I'd run across and not added to that page over the last few months, including Alexa, Healia, and WebCite, along with the ones mentioned by ResourceShelf: DiplomacyMonitor and ZoomInfo. I've moved IncyWincy to the former sources at the bottom, since I can no longer find cached links there. That makes at least 14 sources for finding copies of old pages.
At Search Engine Land, Danny has a long report about Google indexing and ranking issues. While other sections of the post talk about an update to the visible PageRank, issues with supplemental results, and duplicate content, I found the short section on the
filetype: command most interesting. Like some of Google's other field search prefix commands,
filetype: results in zero records unless it is combined with another search term. So
filetype:xls finds nothing, but this is supposed to change sometime in the future and will finally let us run a
filetype:search without requiring an additional term. Does this mean that other field searches will be able to be run separately as well? We'll have to wait and see. In the meantime, if you'd like to get all the results Google will give you for some unusual file type, there is an easy way around the additional term requirement.
Whatever was causing the problem with the site search for Canada that Gwen noted, I am happy to report that it has been fixed. I received an email from SÃ©bastien Richard at Exalead reporting that all the top level domains should work with the
site: prefix now. At least on my tests of
site:in and a few others, it does seem to be working correctly. Kudos to Exalead for making the fix! I've updated my Exalead review.
Yesterday, A9 had a major redesign of its site along with a major accompanying loss of features. The A9 announcement notes that they have "redesigned the A9.com website to make it easier and quicker to discover information from more sources." It has more of a Web 2.0 look and feel, and I think they have achieved a more usable site. The databases are grouped together in the left column and are customizable. Searchers can build their own groups from the more than 400 source databases. Each column (one per database) now features continuous scrolling (like the beta of Live search used to offer and the Live Image search database still does).
From the Official Google Webmaster Central blog come this post on How search results may differ based on accented characters and interface languages. This highlights a change in the way Google handles diacritics and gives a good overview of how it still varies depending on the search interface language chosen.
Google Reader has changed its default sort to date (in reverse chronological order) according to the Official Google Reader Blog in its Your Wish is Our Command. Google always seems to drag its feet with date sorts. With Web results, date sorting is quite problematic since most Web pages do not have a reliable date. So date sorting of Web results rarely is useful. But with news and other published sources, date sorting is easy and helpful. While Google gives the option for a date sort in Google News, it is not the default. Meanwhile, neither Google Books nor Google Scholar even offer the option. Google Scholar's strange "recent articles" addition a few months ago is not much of a substitute since it just limits to recent years and then does another relevance sort. So if a wish is really a command to Google, here's a wish for a real date sort at Scholar and Books and for a default date sort at Google News and Scholar.
SEOmoz Blog, in its All the Different Ways to Calculate Link Numbers (and the Best One) article, gives an excellent overview of the issues with link searching, especially when looking for a total number of results. It primarily compares Google, Yahoo!, and MSN. Of Google: "With the crappiest numbers around, it's a wonder that anyone pays attention to them at all." MSN also gets hammered: "MSN's numbers, while relatively more accurate than Google, are still largely useless." Yahoo! gets the nod here. While the focus of this link search comparison is the reported number of results, the issues also apply to link searching when looking at the content of the results.
There are problems reported with Google's Wildcard Word in a Phrase. The problem is that the asterisk seems to represent either zero or one word. It used to represent exactly one word. For example,
"a little * * * mischief" used to find only "a little neglect may breed mischief" or a similar phrase of six words. Now it also finds pages with just "a little mischief." The cache copy on those pages says that the search terms only appear in pages pointing to the resulting page, but that does not seem accurate. I think that what now happens is that in addition to the way it used to work, Google now also ORs the results of the same search as if the asterisks were not in the query.
With all the cosmetic changes and bad news this week, I am pleased to see some new and potentially very useful syntax from Google. The number range search lets you search for a range of numbers, say for any number between 5 and 11. It even searches for numbers with and without commas and includes decimals such as 7.23. The number range command consists of a smaller number, two periods, and larger number which can be used in conjunction with another search word, as in
score 5..11. Adding a dollar sign invokes the price range search which actually searches for the dollar sign, (although it does not yet recognize the pound (£), Yen (¥), or Euro (€) characters) as in
good books $5..11. See the new number searching section of my Google review for more details.
Although I've had a review of Yahoo! as a directory for several years, now that Yahoo! has launched its own search engine, I've made a first attempt at a review of its search features. Since it is fairly new, I expect to see the features change over the next few months, but at least I have something up that seems accurate as of today. A few notes about the current version of Yahoo! Search and items highlighted in the review:
- The Yahoo! databases appears to primarily be an Inktomi-like database, but there are significant differences from other Inktomi-based search engines like MSN Search and HotBot.
- Both cached copies of pages and HTML versions of PDF and other file types are available
- Only the first 500 KB of a document are indexed, which is better than Googles 101KB but still short of full document indexing that has been available at AlltheWeb
- Full Boolean searching using AND, OR, NOT, and parentheses for nesting seems to work
- Field searches are available with intitle: inurl, site:, link:, hostname:, and url:
- The new search engine database is available on the main Yahoo! site and directly at search.yahoo.com.
Sometime between June and October, Gigablast took away the option to sort by date on the advanced search form. As the only search engine to offer that option, it is a shame to see it disappear. At least Gigablast still reports the date indexed and the last modified date stamp as of the last crawl. Also, the Add URL page is "temporarily disabled."
The full Boolean capabilities of Gigablast announced on Monday don't always seem to work right. The - is working more accurately than either NOR or AND NOT today. I am hoping it is a momentary glitch since I just updated the Gigablast review, the search feature chart, and the search engines by search feature page.
Back in May, Google's intitle: and inurl: were not working properly, as I posted earlier. Well, they now seem to be working again. A search that combines a general query term with these field searches, like "market research" intitle:tourism, now work. I've updated my Google Inconsistencies page to note that problem has been fixed, but I added another report of a strange result for the simple query of 'cameras.'
Beware of diacritics in search terms. The various search engines handle them in different ways. Take a diacritic like e with an acute accent 'Ã©.' Will an 'e' match on both the 'e' as well as the e acute? At Google, searches will only be an exact match. At AltaVista and other search engines, the plain 'e' will match on both (and other 'e' diacritics). For example, a Google search for "epistolaires de mari" found only one hit while "Ã©pistolaires de mari" found more than a dozen. At AltaVista and other search engines, "epistolaires de mari" finds all the diacritic permutations. I hope to do a more in-depth diacritic showdown in the future, but for now, the lesson is to search both with and without the diacritics at Google for the most comprehensive search.
Today, AlltheWeb announced their new advanced search features primarily available on their advanced search page. Most of the announcement is for features that have been available since Sept. 12. Their new KWIC display that has been available since Sept. 24 and that they call "visual relevancy" is also mentioned.
Even though these features have been available for a few weeks, it is a refreshing change to see a search engine actually announcing and publicizing advanced search features. Too often in the past there has been no publicity or even acknowledgment of new features.
Front or beginning truncation that was available at HotBot, NBCi, Anzwers, and iWon no longer works. As far as I can tell, none of the major search engine support beginning truncation any longer (and it was never a publicly announced feature). The search engines by feature page and the reviews for HotBot and NBCi were updated.
At last, two more major search engines that cluster results by site now offer functional ways to uncluster the results. HotBot's filter, announced below on July 6 now seems to work. Google has been showing only the first two pages per site with no option for unclustering all of the results. Now, for searches that find less than 1,000 hits there is a brief message at the bottom of the last page of hits:
"In order to show you the most relevant results, we have omitted some entries very similar to the ones already displayed. If you would like, you can repeat the search with the omitted results included."
Try this search on tilinca with hits set to 100 and then click the link at the bottom to see the unclustered results. By the way, Google has changed the language on its display back from "Show matches (Cache)" to the original "Cached." Sometimes common sense prevails.