May 2008 Archive
Sometimes I find the Google blog posts to be long winded, high on hype, and low on information value. Yesterday's post about Google Search Quality started out in a similar vein, but it quickly improved and contains a number of interesting points about how Google handles searches and ranking. And for all those who like to say, "Just make it more like Google" and expect that to be a simple fix, please note the way Google describes their hard work on search quality is that "more than one thousand programmer/scientist years have gone directly into their development."
Several extracts that I found of interest include:
- Ranking algorithms include many aspects beyond PageRank:
- language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes)
- query models (how people use language today)
- time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time)
- personalized models (not all people want the same thing)
- Evaluation includes automated evaluations every minute (to make sure nothing goes wrong)
- Change Frequency: "In 2007, we launched more than 450 new improvements"
While these do not, perhaps, have any direct bearing on how we can better use Google, it does help to inform us about the rationale for changing results and different processing from one day to the next.
Earlier this month Google expanded the number of languages available in Google Translate. While the press release and most other coverage talked about ten new languages, the number of language pairs (from language X to language Y) increased far more substantially. Previously, Yahoo! Babel Fish had the most with 38 pairs. Google not only upped the number of possible languages, but every language listed can translate to the other. So depending on how you count, Google Translate now has over 500 language pairs available! That's a major increase. As Google Operating System notes, the counting varies depending on how you count Chinese. Only one choice is given for input of "Chinese," but Google Translate seems to accept both the Simplified or Traditional versions. Output can specify either Simplified or Traditional. So, if you count both versions of Chinese as one languages, this means Google Translate can machine translate 506 language pairs. If you consider that as two, it would be 552. And do note that you can input either version of Chinese characters and have it translated to the other.
Also note that Google has not only expanded its machine translation abilities but has augmented its Translated Search as well. Translated Search (also available on the Language Tools page as "Search Across Languages") will translate the query words and then display results in both the original language and in translation. Google translated search can machine translate query words and pages between the following languages. The following ten languages have been added along with the ability to translate between any of the possible language pairs.
Presumably, Google has been able to make such a major expansion of language translation pairs available by using statistical machine translation developed in house. This process is described in their FAQ: we feed the computer billions of words of text, both monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model." Moving to this approach certainly seems to have allowed such a major expansion. Bear in mind that all of this automatic translation is prone to error, although it should give some rough sense of the underlying meaning. I've updated my Online Translation and Translated Search pages with the new languages.