The Price is Wrong, Bob

Few technologies could change the world for the better more than a device that could translate passages of text into any other language. This is a problem that computer scientists have struggled with for decades. The earliest systems just used dictionaries and simple rule sets. So, knowing that the Spanish word  for ‘house‘ is ‘casa‘ and the word for ‘white‘ is ‘blanco‘, and also that Spanish puts adjectives after nouns instead of before them and that the adjective ending must change to fit the noun, a rules-based system would be able to translate the phrase ‘the white house’ to become ‘la casa blanca’.

The problems with this approach are legion. Obviously, the rules sets must be custom-written for each language pair (in each direction) and therefore quickly become very cumbersome. And, history has shown that such systems tend to produce text that is comprehensible, but not natural-sounding. Part of the problem is that there is not a one-to-one translation of words from one language to another, and the best translation strongly depends on context. So the English words bear, bat, fly, fish, ratduck, goosecow, hawk, and badger might have one best translation when referring to animals, but another when they are used as verbs (to bear, to bat, to fly, etc.).

Also, in some cases, the best translation of a word is to simply leave it alone. If we translate the phrase above: “The Spanish word  for ‘house’ is ‘casa’.” into Spanish, a rules-based translation might give: “La palabra española para ‘casa’ es ‘casa’.” But in that case, the first word is an example of an English word – In this specific case, it should be left in English.

To know which version of a translation is best to use when, Google (and much of the natural language processing community) has turned to a technique called ‘statistical machine translation‘. Rather than try to write enormous sets of rules, Google has surfed billions upon billions of webpages, some of which (like the government records of officially bi- or multi-lingual countries like Switzerland and Canada) contained passages of translated text. Using statistics, they then process the text they find to guess best translations.

For example, Google’s Translate service says that the English word ‘black‘ in German is ‘schwarz‘. The phrase ‘black cat‘ is ‘schwarze Katze‘, because ‘schwarz‘ means ‘black‘ and ‘Katze‘ means ‘cat‘. But when I type in ‘black eye‘, Google Translate gives ‘blaues Auge‘. ‘Auge‘ obviously means ‘eye‘, but what happened to the word ‘schwarz‘? The word ‘blaues‘ means ‘blue‘, not black. So, what’s gone wrong here?

Well, it turns out, nothing has gone wrong. Google Translate has gotten this exactly right. The expression ‘a black eye’ is an idiom. In English we might say ‘I gave him a black eye’, but the German equivalent literally means ‘I gave him a blue eye.’ So when talking about the color black in general, ‘schwarz‘ is the best translation, but when talking specifically about a black eye, ‘blaues Auge‘ is best. Google knows this is true because they have seen many, many instances of the word ‘black‘ being translated as ‘schwarz‘ and many instances of ‘black eye‘ being translated as ‘blaues Auge‘, but very few cases (if any) of ‘black‘ being translated as ‘blau‘ or ‘black eye‘ as ‘schwarzes Auge‘.

The result of all of this web-surfing and data processing is that Google Translate can handle short snippets of text between many language pairs, but can also entire webpages in just a couple of seconds. You can read a page and then click on one of the links to get that page translated for you, and so forth. So, you can surf foreign webpages in your language, maybe even order products or services, without ever seeing the original pages. The service is not perfect: for example, if there is apicture of text (like in a banner image or certain kinds of buttons), the original text will remain. But, in general, it is very good.

Except for this little problem I found: translating prices, dates, times, names and perhaps other critical information. Below is a screenshot from Google’s translated page for Qualité Search Marketing’s Academy page. QSM is a Norwegian search engine marketing firm that, among other things, runs a series of short courses (their ‘Academy’) that you can attend in the Oslo area to learn more optimizing landing pages and writing better search ads and so forth. All of these courses are run in Norway and are priced in Norwegian kroner (‘kr’). But look at Google Translate’s version of the page.

(click image to enlarge)

Google has done a good job with the number itself. The original text follows the European form (“4.500,00″, which means 4 thousands, 5 hundreds and 0 partial units of currency). The translated number reads (“4500.00″), which follows the American form of writing numbers.

But for the unit of currency, Google is acting funny. You can see that the original text for one entry (‘Tactical search engine marketing’) says that the 6-hour course costs 4,500 kroner (about $765 US dollars, or roughly the price of a beer at any bar in that country). But Google Translate says that the course costs 4,500 British pounds, or about $7200.

For some strange reason, the price of the last course offered on the page has been translated to say 4,500 New Zealand dollars (about $3400 US), even though its original text is exactly the same as the ones that Google turned into 4,500 British pounds.

All of the dates that appear on this translated page are listed in their original form (for example, ‘25.11.10’, which means the 25th day of the 11th month of the 10th year, or November 25, 2010). Google has left the date in its original form, presumably because it doesn’t recognize that ‘25.11.10’ refers to a date in this particular case.

Over at QSM’s blog, though, things get a lot goofier. Check out this screenshot:

(click image to enlarge)

Google says that one entry about Google Places was posted by ‘Tøien Agnete Pedersen’ on ‘09.17.1910’. Here, Google has recognized that ‘09.17.10’ refers to a date and the month and day have been switched properly from Norwegian format to American format, but Google’s interpretation of ’10’ to mean 1910, not 2010, is strange. (Also puzzling is the fact that Google says the author’s name is ‘Tøien Agnete Pedersen’, when the original text says ‘Agnete Tøien Pedersen’. Oddly, on Google’s translated version of QSM’s Employees page, they turn ‘Agnete Tøien Pedersen’ into ‘Agnete Pedersen Tøien’.)

In the post by Magne Uppman, even though Translate has recognized that the original text (‘13.09.10′) is a date, Google has not changed the format to American style (‘09.13′, rather than ‘13.09’). Google managed to switch the format properly in the previous post, but not this one for some reason. To make things even stranger, in some later posts on the translated page of Qualité Search Marketing’s blog, in some of the dates Google has replaced the periods with slashes so that the dates read things like ’06/02/1910′ (the 2nd day of June, 1910), and in other places the periods are replaced by slashes but the order of month and day is not switched from the original format (so we get things like ’26/05/1910′, or the 26th day of May).

In the post above about Yahoo! Desktop Search Marketing, Google shows that the time it was published was ’24:57′ on October 14th (which would actually be 57 minutes into October 15th). The original text says it was actually published at ’12:57′ on October 14th. In a post (not shown in the screenshot above) that the original text was published at ’13:41′ (following Norway’s use of ‘military time’ for times after noon) Google has simply left it as ’13:41′ in the translated text. However, in another post which was published at ’22:06′, Google has turned that into ’10:06 pm’, so that the translated text uses both formats for times between noon and midnight.

Some later posts list Magne Uppman’s name as ‘Magnus Uppman’ and one post whose title in the original text was ‘Google with Twitter Ads’ has been “translated” from English to English to become ‘Twitter with Google Ads’.

(click image to enlarge)

(Interestingly, the Norwegian text on the page gets translated fairly well, but the ‘Google with Twitter Ads’ post, which is actually in English, looks like it was run through a meat grinder. This means that Google can translate Norwegian to English better than it can recognize that English text is English and just leave it alone.)

What’s really weird to me is that all these multiple different display formats are being used on the same page. Four dates, four different ways of showing them. One name, several different ways of showing it. I can understand why translations would go wrong sometimes, but am having more trouble understanding how a single date format in the original text could be rendered in multiple different ways in the translated text. It makes me wonder if Google is intentionally showing multiple formats and styles of time, date, place and name translations for the explicit purpose of getting website owners to suggest better translations of the information shown on their own site. What person, when shown a mistranslation of their own name, or what company whose price list got mangled, wouldn’t take a few minutes to set Google Translate straight?

Could it be that Google is simply attempting to ‘crowdsource’ the translation of the toughest parts of context-sensitive text? Even if Google is not intentionally fiddling with their translations in order to elicit corrective feedback, I guess my question is: shouldn’t they?

About these ads

  1. Diandio

    Is a very well written post….. Thanks so much for information. ….Sorry for my English




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s



Follow

Get every new post delivered to your Inbox.

%d bloggers like this: