ETAION SRHLDC

 

A couple of my favorite blogs, TYWKIWDBI and Andrew Gelman, have covered letter frequency in the English language recently, particularly in relation to the game Hangman.

The most-common sequence I have seen for the usage of letters is that E is the most often used, followed by T, A, and so on. The 12 most commonly used letters are supposedly: 

ETAOIN SHRDLU

However, the frequency of letter obviously depends on the body of text (or, ‘corpus’) that you analyze, so the sequence for the remaining letters of the alphabet depend on whether you look at books or newspapers or webpages, etc.

Wikipedia offers a list based on the frequency of letters in books that were digitized as part of Project Gutenberg, a set of a few tens of thousands of classic works that were typed by hand in the days before high-speed optical character recognition (OCR), like what Google has used to digitize millions of books.

From Project Gutenberg, the frequency of the 12 most common letters in English is:

ETAOIN SHRDLC

But Google offers statistics about the millions of books it has digitized, including a list of every word that has appeared in print and the number of times it appeared in each year, for both books published in the US and in the UK. (Granted, that list for US-published books alone takes up about 7 Gb of hard disk space.) 

It is straightforward to simply run through the list, adding up the number of time each letter appeared each year, regardless of whether it was a capital letter or lower-case. So, if the word ‘data’ appeared in US books 496,559 times in 1982 (as Google’s stats claim), then we record that the letters ‘D’ and ‘T’ were used 496,559 times (for that word alone) and the letter ‘A’ used 993,118 times for that word alone. Run through every word and add up all the times that every letter was used. What I found for books published in the years 1800-2000 (inclusive) was the following:

ETAOIN SRHLDC

Very similar to Project Gutenberg, just with the positions of some of the letters in that latter 6 switched. This was true for both US- and UK-published books.

But when I broke down the numbers by year, things got more interesting. This sequence (ETAOIN SRHLDC) is true for the year 1900 and actually all of the years from 1835-1963 for both US- and UK-published works.

However, from 1800-1834, the sequence of the first 6 letters was ETAOIN some years (including 1800, 1802, 1804 and 1805) and other years was ETOAIN including 1801, 1803 and 1806).

In the year 1964 and after, the sequence of the 6 most-common letters became ETAION (with the exceptions of 1968, 1969 and 1970, when it was still ETAOIN). This is true from both US- and UK-published books.

So, there seem to have been two shifts in letter usage in the English language: one that occurred in the years leading up to 1835, when the sequence shifted between ETAOIN and ETOAIN, eventually settling on ETAOIN until 1964, when it shifted again to ETAION and remained that way until at least the year 2000 (with the exception of 1968-1970).

My suspicion is that Project Gutenberg’s sequence comes from the fact that many of the books they have digitized were published before 1964.

In both shifts, ‘O’ became less popular. I am now looking into what might have caused these shifts. Google’s Ngram Viewer does not seem to support the idea that changes in the personal pronouns ‘I’ and ‘you’ were responsible. Perhaps the gradual shift in some verbs (like from ‘smote’ to ‘smited’ and ‘strove’ to ‘strived’) is what is causing ‘O’ to become less common over time.

Advertisements

  1. jaoriwe

    This is amazing 🙂
    Thanks for the informative work!




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s



%d bloggers like this: