How many words do you actually use? March 11, 2007Posted by caveblogem in Blogs and Blogging, linguistics, Other, statistical analysis, vocabulary, writing.
A while back I was reading Litlove’s blog, Tales from the Reading Room and came across the following snippet:
I was idly looking around the internet to try to find out the average size of an English speaker’s vocabulary, but it turns out the very complexity of English makes it difficult to gauge. Estimates of a college graduate’s vocabulary range from 20-25,000 to the supposedly more accurate 60,000 active words and 75,000 passive words.
This interested me, because as a part-time researcher I immediately saw how incredibly difficult such an experiment would be. Every word that came out of somebody’s mouth would have to be checked against a list of words already uttered. So I suggested that somebody undertake an experiment whereby blog posts could be examined to unearth the number of different words one actually uses while blogging. I had suggested that a macro could be written in MS Word that would yield the necessary data, but have found that, as in most areas of life, Word is needlessly complex, and not up to the task anyway.
So with some additional work I managed to figure out how to count the number of different words I use on my blog. I took a sample of about 15,000 words, stripped out all of the numbers, emoticons, and other junk that aren’t words (although “aren’t” would show up as one word, using my method, as does any word that is a hyphenated compound, like Anti-Christ). I also stripped out any proper nouns, as well as misspellings, which was the time-consuming part of this little experiment. I’m surprised how many words I mis-spell.
But here’s what I found: Out of 13,000 words (what was left from the sample after getting rid of names, mis-spelled words, etc, and then making a round number out of it) I used 2,734 unique words. That sounds pretty pathetic when you think about the numbers bandied about in the snippet quoted above, doesn’t it? But it is a pretty small sample, 13,000. On the other hand, it took a bit of time, even with 13,000 words. So I decided to turn it into a bunch of progressively smaller samples and use the data to extrapolate a bit.
The data formed what looked to be a quadratic equation:
Here’s the actual data plotted, with an extrapolation that tops out at the humbling figure of 3,127 (click picture to enlarge. The red marks are experimental data, the blue ones plot the quadratic above):
Again, not very impressive. And it is just a statistical approximation, of course. The real number wouldn’t actually have such a maxima, it would slowly increase toward an assymptote, and then jump right over it while you are arguing over a scrabble game or a crossword puzzle, or playing trivial pursuit.
There are all sorts of words that I know, but that I have only used once–when taking the Graduate Record Exams. And there are words that I find myself using very rarely, of course. I was listening to “Car Talk” yesterday coming back from the tennis courts and one of the two, Click or Clack, I forget which, used the word “kine,” an archaic plural of the word “cow.” The previous sentence, I think, is the only time I’ve ever used that word. But if my little experiement shows anything, it shows that it would take me a lot of blogging to reach the 60,000 passive words that Litlove found as an estimate for an English-speaker’s active vocabulary.
Over the next few days I will be checking to see how this figure, this humbling 3,127, compares to other blogs out there, if only to make myself feel better. I’ll entertain suggestions as to which blogs, of course.