Is Google’s OCR getting worse in some ways?

Google uses Optical Character Recognition (OCR) to turn scanned pages from millions of printed books into digital text.

But OCR is not always accurate. For example, before about the year 1800 it was common to sometimes write the letter ‘s’ in an elongated form that looks a lot like the letter ‘f’.

So, in the image below from John Norris’ 1710 book ‘A Collection of Miscellanies’ you can see the passage: “The first Thing that I observe, is, that ’tis generally agreed upon among them, That this Fruition of God consists in some Operation; and I think with very good Reason.”

But if you look closely you’ll see that the ‘s’ in words like ‘first’ and ‘observe’ looks a lot like an ‘f’, so that the words look like ‘firft’ and ‘obferve’. Just in this short passage we can also see ‘confists’, ‘fome’, ‘Reafon’, ‘Happinefs’, ‘underftand’, ‘beft’, ‘laft’, ‘fo’ and ‘underftood’.

Obviously, it can also be difficult for a computer program to understand that these are s’s, not f’s.

In 2009, when Google released summarized statistics about the occurrence of different words in the books they scanned, there were many instances of words like ‘beft’ and ‘firft’, especially in text written before 1800. However, they updated their analysis in 2012, reclassifying many of those occurrences of ‘beft’ as ‘best’.


(click image to enlarge)

But, for some words, the newer version of Google’s statistics looks like they have gotten worse at identifying the correct word. For example, the capital letter ‘O’ looks a lot like the number ‘0’, so in the chemical formulas for compounds that contain oxygen, like ‘H2O’ (h-two-oh), can easily be misread as ‘H20’ (h-twenty).

For cases like ‘H2S04’ and ‘Fe203’ (which contain zeroes instead of the letter ‘O’) , it is highly likely that these words are simply misreadings of ‘H2SO4’ (sulfuric acid) and ‘Fe2O3’ (iron oxide).

chemical-compare(click image to enlarge)

I have been able to find many more examples like this using the names of chemical compounds that contain oxygen. In each of these cases, Google’s character recognition for the 2009 statistics seems better than their 2012 values.


