Google Ngrams mysteries: word length and decimal frequency in US and UK texts

I made the graph below from Google Ngrams data. I downloaded from Google Ngrams datasets a list of every word that has appeared each year in all the millions of US and UK books that Google scanned. These records also state how many times a given word appeared in each year. For example, Google says that they found the word ‘analytical’ 20,096 times in books that were published in the US in 1985.

I then sorted through these lists to keep only lower-case single words (to filter out proper nouns) that were made from the 26 letters of the English alphabet. So, this threw out numbers and words that contain numbers (like ‘MP3’). I also only considered words that were 3 or more characters long (since Google considers punctuation to be ‘words’). Also, Google claims that their statistics for books published after the year 2000 might not be reliable, so I only looked at the time frame from 1860-2000.

Here is what I found for the average length words that are at least 3 characters long, for both the US and the UK:

(click image to enlarge)

I know it’s a stereotype that every generation thinks that the world is getting “dumbed down” and that their generation was more scholarly and erudite. But this graph shows that, for published authors at least, the opposite seems to be true. For most of the 140 years from 1860-2000, the average length of words in both US- and UK-published books has been rising. Perhaps more surprisingly, the average word length in US texts was substantially greater than UK works for basically the entire period from 1860 through the early 1980s.

What can explain the generally rising trend in both countries for much of the 1900s? One possibility is the popularity of more and more abstract concepts. Concrete ideas usually have short words to describe them: fire and war and food and sex and dirt and death and dogs and anger and greed and wine and blood and babies. More abstract concepts, on the other hand, like recapitalization and epistemology and hypernationalism and sustainability and isomerization and neoclassicism and refactoring tend to be much longer words. As the economies grow, perhaps authors talk more about abstract concepts and less about concrete ones. (Of course, it is also possible that as economies grow, authors publish more books of fiction and diet tips and self-help books. So, if the trend had been downward rather than upward, I would have had a handy rationalization ready to go, too.)

Personally, I am bit surprised that times of war and unrest do not seem to be noticable in the trends. I would suspect that in times of war, authors would favor shorter, more gutteral words like ‘war’ and ‘peace’ and ‘blood’ and ‘death’ are used more while longer, more abstruse words are used less, but I guess not.

The average word length for US-published books has shown a disturbing reversal since the early 1980s, with the UK passing the US for the first time in the early 1990s. I am not certain what to make of this as Google’s data makes no effort to account for the popularity of works (so a book that sold a million copies might get scanned once, while one that sold 5 copies might also get scanned once). Also, a study by Jean-Baptiste Michel and Erez Lieberman Aiden noted that the quality of Google’s scanned records might not be very high, as they need to manually inspect  many records to ensure that they were not mischaracterized.

Google’s datasets are chock-full of mysteries like this one. For example, I also looked at the fraction of words in the US- and UK-published works that are a decimal number which is at least 2 digits long (like, 1.3, which is two digits, or 314.15, which is five digits). Here is the result:

(click image to enlarge)

Personally, I find these results difficult to believe, given the large discrepency between the US and UK and no explanation I can think of for why this might be. On the other hand, these are the results I have found so I shouldn’t dismiss them off-hand simply because they don’t fit my preconceived notions.

So, I would love to dig deeper into this, but the amount of effort required to put together justifiable statements about Google’s data is enormous.

If you might have any insight, please let me know in the Comments section below.

Advertisements

  1. akismet-c3ae90403018fdfe963bff4ee8024fda

    Weird. I’m guessing that it has something to do with trends in books for children versus books for adults – that could explain short vs. long words and large numbers of digits.

    You might use that crazy analysis tool you have to check out population demographics for the US and UK over that time period – I would bet there’s some sort of correlation.




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s



%d bloggers like this: