One More New Time-Out

Google has updated the data they provide about the frequency of words in US English text (among other languages). I downloaded the data sets from Google Ngrams.

In a previous post of mine, I noted that the average length of words in US English text has been growing fairly consistently for at least 150 years. In 1860, the average word longer than 3 characters in US English text was 5.2 characters long. By the 1970s and onward, it was at least 5.6 characters.

But length of words is only one way to measure sophistication of text.

Another is the number of words required to be able to recognize a certain percentage of printed words. From Google’s data I found the most popular words published in the year 2000, and then stripped out so-called “stop words” like the, of, and an (and and, for that matter). (A complete list of the stop words I used appears below.)

It turns out that the most popular non-stop words in US English in the year 2000 were: one, more, new, time and out. The five words accounted for about 1.7% of all printed words in US English in the year 2000, according to Google’s data.

If you were willing to learn the 100 most-popular non-stop words, you would be able to recognize about 12.5% of all printed words.

However, among texts published in 1970, knowing the 100 most-popular non-stop words means that you would have recognized about 13.2% of words. And in 1900, about 15.4% of words. (We’re not talking about particularly complicated words here either. In 2000, the 100th most-popular word was power. In 1970, second. And in 1900, having.)

I have made below a graph of the percentage of US English words you would recognize if you knew the 10 most-popular, 100 most-popular, 1000 most-popular, all the way up to 10000 most-popular words in 1900, 1970 and 2000.

US-English-cumulative-density-function

(click image to enlarge)

It turns out that as time has been going on, knowing the most-popular words will allow you to recognize less and less of the English language.

That is, not only is the average length of our words getting longer, but studying a current list of, say, the 500 most-popular words will get you less and less distance through a random passage of text as time goes on.

(Again, these are not particularly complicated words. For the year 2000, the 10,000th most-popular word was wagons. For 1970, sailor.)

My point is: printed US English seems to be getting more sophisticated over time, not less so. If you’re a book editor who is planning to travel through time, it would be easier to back to 1900 than for someone from 1900 to come forward to our time.

“Stop Words” for this exercise:

a, able, about, across, after, all, almost, also, am, among, an, and, any, are, as, at, be, because, been, but, by, can, cannot, could, dear, did, do, does, either, else, ever, every, for, from, get, got, had, has, have, he, her, hers, him, his, how, however, i, if, in, into, is, it, its, just, least, let, like, likely, may, me, might, most, must, my, neither, no, nor, not, of, off, often, on, only, or, other, our, own, rather, said, say, says, she, should, since, so, some, than, that, the, their, them, then, there, these, they, this, tis, to, too, twas, us, wants, was, we, were, what, when, where, which, while, who, whom, why, will, with, would, yet, you, your

Advertisements



    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s



%d bloggers like this: