jump to navigation

Which words do you own?–Alabaster Crippens March 17, 2007

Posted by caveblogem in linguistics, Other, tagging, vocabulary, writing.
5 comments

I keep expecting that the next blog I add will not contribute any new words to this study, and that I will have to devise some other way to look at the data.  I have thought about this a great deal while I excise non-words and proper names from the samples.  One way would be to create a word cloud out of words that a blog uses much more frequently than other blogs do (so I’m not without a contingency plan). 

But there is no need to do this quite yet.  Mr. Alabaster Crippens added 839 fresh words to the pile of data that I continue to amass.  His most frequent addition was alabaster, which is not just a name, of course. 

accpy.jpg

I’ll resist the urge to psychoanalyze Mr. Crippens based on the next most popular group of words, but it is tempting to try.  I find it interesting that words that don’t seem all that odd continue to pop up in these samples.  I use the word “idiot” six or eight times a day.  Strange that it didn’t pop up until I added Mr. Crippens’ sample.  I’ve collected more than 100,000 words now, and 11,900 distinct words reside in the database.

And I have another volunteer for tomorrow or the next day, depending on when I get around to it.  I have never read EelKat‘s blog, but it looks to add some fresh words to the pile as well. 

Which words do you own?–Raincoaster March 17, 2007

Posted by caveblogem in 3QD, Blogs and Blogging, daily kos, dailykos, linguistics, Neil Gaiman, Other, Three Quarks Daily, vocabulary, writing.
5 comments

I first saw raincoaster‘s blog in WordPress’s list of fast growing blogs or popular blogs or something like that.  I check in once and a while and am always entertained.  We share a passion for H. P. Lovecraft and squids and a couple of other things. 

I took a sample of her blog posts yesterday and processed them, and I was a little bummed out at first about the fact that the largest words in the cloud are tag words, and so they present little new information.  Many of the others have to do with current events, so they don’t seem like the timeless blogger fingerprints I had envisioned a day or two ago. 

Nevertheless, if you look at some of the smaller words that pop out, distinguishing her vocabulary from those of Three Quarks Daily, Daily Kos, Pretty Good on Paper, and Neil Gaiman’s Journal, you’ll see some interesting nuggets (below, click to enlarge).

vocabcloud-rc.jpg

On a side note about Neil Gaiman, Dan somebody (whose relationship to Mr. Gaiman I did not quite get) has done some interesting analysis of Mr. Gaiman’s blog over a long span of time.  The links, if you should wish to pursue them, are in the comment thread to the post on Mr. Gaiman here.  The analysis and method Dan used seems more sophisticated than mine.  For example, he passes Mr. Gaiman’s words through something called a “Yahoo Term Extraction API.”  If I remember my Latin roots correctly it seems to have something to do with bees.  At any rate, his analysis also takes a slightly different tack, examining “terms,” rather than words, and chopping off words that have fewer than five letters.  So what you will see are dynamic clouds of what I suspect are topical concerns of Mr. Gaiman, rather than the individual word usage.  They are fascinating to look at, however, and quite clever.

The day before yesterday I stood for several minutes looking at a book on Java at my local Barnes and Noble.  If it hadn’t had water damage, which made that funny noise and wouldn’t let me flip pages easily, I would have actually bought it.  If anyone is interested in saving me the time and trouble it would take to learn some sort of dynamic html, I would happily partner up, supplying data and analysis for the creation of some sort of acceptable widget. 

Up next, Alabaster Crippens.

A short digression March 17, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, literature, Other, vocabulary.
add a comment

I always put posts up without doing significant research into what other people are doing on whatever subjects I happen to be writing about.  Luckily, this makes me look careless and ignorant, instead of lazy and self-absorbed, which is probably closer to the truth. 

Anyway, this morning I got a comment somewhere in this vocabulary thread from kuipercliff, who pointed me to two very interesting sites that perform and display research along similar lines.

WordCount™ is an artistic experiment in the way we use language. It presents the 86,800 most frequently used English words, ranked in order of commonness. Each word is scaled to reflect its frequency relative to the words that precede and follow it, giving a visual barometer of relevance. The larger the word, the more we use it. The smaller the word, the more uncommon it is.

The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.

The BNC database is proprietary, and English, too, so I’ll keep using my own, for now.  But both of these sites are pretty interesting, as is Kuipercliff’s blog.

Which words do you own?–Neil Gaiman March 16, 2007

Posted by caveblogem in Blogs and Blogging, bookmooch, Books, Cartooning, fiction, literature, Neil Gaiman, Other, vocabulary, web 2.0, writing.
7 comments

Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here.] 

I began to read the work of Neil Gaiman last year when somebody suggested I read Good Omens, a collaboration between Mr. Gaiman and Terry Pratchett.  Then I read American Gods and Neverwhere and everything else I could get my hands on.  The only thing I haven’t been able to get ahold of is his latest, Fragile Things, which nobody has posted on Bookmooch or Paperbackswap (have to be a little frugal this year, I’m afraid.)  Anyway, Mr. Gaiman is a tremendously talented writer of creepy and interesting tales.  And he writes a darn good blog, too, which I subscribe to and read whenever I can.

I sampled 22,000 words from Mr. Gaiman’s site, spanning the period January 6 – March 14, yesterday morning.  I had to run the spell-check a little differently from the way I normally do, because Mr. Gaiman uses the English spellings of words like color, organize, check (cheque, a draft on one’s checking account), favorite, and orangutan.  So I just changed these to the Americanized versions in his list so that I could merge it in with the others.

I have started to add some words to my spell-checker, and with Mr. Gaiman’s blog I added googled, blog, blogger, blogging, edamame, and perhaps a couple of others that I forgot to write down at the time but which I was absolutely certain were correctly spelled words.

The Blogger’s Vocabulary List is getting larger with each blog I incorporate.  The latest, which includes samples from Three Quarks Daily, Daily Kos, this blog (Pretty Good on Paper) and Neil Gaiman’s Journal, contains 9,383 different words.  In a couple of months I should be able to make a pretty good estimate of the size of the vocabulary in actual use out there (here?) in the blogosphere.  Check this space for updates.

Mr. Gaiman added 1,112 words to the list, an impressive feat at this point for an individual blogger.  Here is a vocabulary cloud composed of the words Mr. Gaiman added to the list, with font sizes at twice the point size as the number of times they appeared in his 20,000-word sample (click for a larger image).

cloudng.jpg

I’ve decided to stop estimating the size of the vocabularies of individual blogs in this study because such estimates are too artificial.  Even bloggers and writers use most of their words in conversation.  And since your vocabulary is altered by each conversational partner, (your conversational partner asks a question about broccoli or oysters and you find yourself using these words yourself, if only to ask for clarification) estimates of this sort don’t seem all that relevant.

What does Mr. Gaiman’s vocabulary cloud say about him as a blogger?  What does it say about the bloggers to which his words were compared?  What will Raincoaster‘s vocabulary cloud say about her or us or anything, when it is added to this growing pool tomorrow? 

Anybody?
Anybody? 
Anybody?
Bueller?

Digital Doodlebrains March 15, 2007

Posted by caveblogem in Blogs and Blogging, Cartooning, Other.
add a comment

Drawn just posted a link to a site I hadn’t seen before–Digital Doodle, which puts up a theme and lets you draw a picture, then posts it for votes.  I only got 3 hours of sleep last night, so although I had just read about the site, and that pictures were supposed to be based on a particular theme, and that the theme today was “Jungle,” I drew a squid.  Does the word “Jungle” encompass underwater scenes?

Much to my surprise, it turns out that my drawing was more in keeping with the actual (as opposed to official) them of the day, male genitalia.  This was accidental on my part, though.

Click picture below to enlarge.  Better yet, just don’t bother.

waryhumboldt.jpg

A Vocabulary Cloud March 14, 2007

Posted by caveblogem in 3QD, Blogs and Blogging, daily kos, dailykos, linguistics, literature, Other, politics, tagging, Three Quarks Daily, vocabulary, web 2.0.
3 comments

I collected a bigger sample of words from my own blog this morning (I did not include the recent posts on vocabulary stuff, because I might have influenced the number of unique words, as MoonTopples and strugglingwriter both pointed out.  Instead, the sample is from posts January 4 – March 9, 2007, totalling 20,000 words.  See the first post in this series if you don’t know what I am talking about.) 

Then I put them in a database with the samples from Three Quarks Daily and Daily Kos and just pulled out words that were unique to my site (words which did not appear in 3QD or Daily Kos samples at all).   Then I sorted these by the number of times they appeared on my site, assigning a font twice as large as the number of occurances on my blog (so that words appearing three times are in 6 point Verdana–I didn’t include words unique to my blog that appeared fewer than three times, because there were more than a thousand).  I then sorted them again alphabetically and the result is the vocabulary cloud below (click for a larger image). 

vocabcloud-pgp.jpg

It is, in some ways, the opposite of the tag clouds you see in technorati, because so many of these are made up of proper names, which have been excluded from the samples I took. 

It’s like a blogger’s fingerprint.

How many words do you actually use?–DAILY KOS March 13, 2007

Posted by caveblogem in Blogs and Blogging, daily kos, dailykos, linguistics, Other, politics, vocabulary, web 2.0.
2 comments

I may turn this examination of blogging vocabularies, which I began in this post, into a weekly series.  Although I am automating more steps as I perfect the technique, it is still a little time-consuming to do this every day. 

Today I examine the vocabulary of one of the foremost political blogs in the United States–Daily Kos.  (I should probably note that while this was the first blog that came to mind when I decided to look at the vocabularies of some of the “A-List” blogs, it is not one that I actually read.)

In the first 13,000 words of the sample, Daily Kos used 3,053 words, which compares favorably to my total from the other day of only 2,734.  The chart below shows data from the whole sample, which included 20,000 words taken from posts beginning with Tue Mar 13, 2007 at 11:29:01 AM PDT, and ending four pages later, in 500-word increments. 

dkestchart.jpg

(The extrapolation in the above chart uses the equation below.)

dkequation.jpg

The estimated total vocabulary of this blog is therefore 4,152 unique words.

How many words do you actually use?–Three Quarks Daily March 12, 2007

Posted by caveblogem in 3QD, Blogs and Blogging, linguistics, Other, statistical analysis, Three Quarks Daily, vocabulary, web 2.0, writing.
6 comments

I said in my last post that I was going to look at the vocabularies of some other blogs to make me feel better.  And I did, but I left the analysis I did yesterday of the foremost political blog in the US at home, so I can’t post it today (and the answer is “yes,” it did make me feel better). 

I can post another blog’s results, however.  Three Quarks Daily is the first blog I ever read.  That’s not quite right.  3QD is the first thing I ever read after somebody showed it to me and said “that is a blog.”  Anyway, 3QD is still one of my favorites.  It is like getting the best of Science, The New Yorker, The Economist and several other top-notch magazines delivered daily with interesting commentary. 

It is written by at least 4 or 5 people per week and by as many as maybe 12.  So, if these people have different vocabularies, the total number of words they might use in a sample could be expected to exceed mine by a good margin.  And it did. 

In a 13,000 word sample covering March 4 – March 12, 3QD used 3,654 different words (my total, for a comparison, was 2,734 words, so they used about a third more than me).  I actually drew a much larger sample from their site than I did mine, partially because I wanted to be comprehensive, fair, and careful.  In 18,000 words 3QD used 4576 words. 

And a little regression produced the following quadratic equation [The “Pearson’s R” was actually a little better when using a cubic model, but that would imply an infinite vocabulary, and as much as I respect the team over there at 3QD I feel like I have to draw the line somewhere]:

3qdequation2.jpg

Which, as you can see in the chart below (click to enlarge–actual data is in red, estimates are in blue) tops out around a total vocabulary in use of 5,698. 

3qdchart.jpg

It occurred to me this morning that it might be interesting to see how many words are shared by different blogs, as a sort of Venn diagram, perhaps.  Do bloggers speak different languages?  How much vocabulary is shared by different blogs?  Which words do they share?  Which are different?  I must know!  So I’ll be taking a look at that question after I do a couple more of these. 

How many words do you actually use? March 11, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, Other, statistical analysis, vocabulary, writing.
49 comments

A while back I was reading Litlove’s blog, Tales from the Reading Room and came across the following snippet:

I was idly looking around the internet to try to find out the average size of an English speaker’s vocabulary, but it turns out the very complexity of English makes it difficult to gauge. Estimates of a college graduate’s vocabulary range from 20-25,000 to the supposedly more accurate 60,000 active words and 75,000 passive words.

This interested me, because as a part-time researcher I immediately saw how incredibly difficult such an experiment would be.  Every word that came out of somebody’s mouth would have to be checked against a list of words already uttered.  So I suggested that somebody undertake an experiment whereby blog posts could be examined to unearth the number of different words one actually uses while blogging.  I had suggested that a macro could be written in MS Word that would yield the necessary data, but have found that, as in most areas of life, Word is needlessly complex, and not up to the task anyway. 

So with some additional work I managed to figure out how to count the number of different words I use on my blog.  I took a sample of about 15,000 words, stripped out all of the numbers, emoticons, and other junk that aren’t words (although “aren’t” would show up as one word, using my method, as does any word that is a hyphenated compound, like Anti-Christ).  I also stripped out any proper nouns, as well as misspellings, which was the time-consuming part of this little experiment.  I’m surprised how many words I mis-spell.

But here’s what I found:  Out of 13,000 words (what was left from the sample after getting rid of names, mis-spelled words, etc, and then making a round number out of it) I used 2,734 unique words.  That sounds pretty pathetic when you think about the numbers bandied about in the snippet quoted above, doesn’t it?  But it is a pretty small sample, 13,000.  On the other hand, it took a bit of time, even with 13,000 words.  So I decided to turn it into a bunch of progressively smaller samples and use the data to extrapolate a bit. 

The data formed what looked to be a quadratic equation:

vocabeq.jpg

Here’s the actual data plotted, with an extrapolation that tops out at the humbling figure of 3,127 (click picture to enlarge.  The red marks are experimental data, the blue ones plot the quadratic above):

pgop-curve.jpg

Again, not very impressive.  And it is just a statistical approximation, of course.  The real number wouldn’t actually have such a maxima, it would slowly increase toward an assymptote, and then jump right over it while you are arguing over a scrabble game or a crossword puzzle, or playing trivial pursuit. 

There are all sorts of words that I know, but that I have only used once–when taking the Graduate Record Exams.  And there are words that I find myself using very rarely, of course.  I was listening to “Car Talk” yesterday coming back from the tennis courts and one of the two, Click or Clack, I forget which, used the word “kine,” an archaic plural of the word “cow.”  The previous sentence, I think, is the only time I’ve ever used that word.  But if my little experiement shows anything, it shows that it would take me a lot of blogging to reach the 60,000 passive words that Litlove found as an estimate for an English-speaker’s active vocabulary.

Over the next few days I will be checking to see how this figure, this humbling 3,127, compares to other blogs out there, if only to make myself feel better.  I’ll entertain suggestions as to which blogs, of course.

I’m still here March 9, 2007

Posted by caveblogem in bookmooch, Books, Education, history, literature, narrative, Philosophy.
add a comment

I’ve been really busy lately.  Our campus has been searching for a new Chancellor, which is what we call our chief executive here.  What with the public meetings, newspaper articles (for one of the top candidates is the Congressman of the Massachusetts Fifth District, the Honorable Martin Meehan, gaining us national attention), and attendent gossip and what if talk, it is awfully hard to get things done and also accomplish my new, and more demanding, position. 

On  a distantly related subject (trust me on this, for now), it occurred to me the other day that I had been unfair to someone in the past that I am building much more respect and admiration for these days.  That person is the new President of Harvard University, Drew Gilpin Faust.  Back in graduate school I had to read her book Southern Stories: Slaveholders in Peace and War, and my review of this book was . . . ungenerous. 

Southern Stories is not only excellent scholarship, it is also good writing and has some interesting things to say about how narrative shapes worldview.  My objection to the book at the time was twofold, I now realize. 

  1. It is about Slaveholders in the Antebellum South (and during the war, too, of course).  Let’s face it, people, I should have studied philosophy.  I would have, too, if there had been a well-funded Ph.D. program at the university where I ended up.  Mostly I didn’t care about history and still don’t.  There are times when it is relevant, deeply relevant and important.  Mostly, though, you can get by without it, I think.
  2. Dr. Faust is one of those scholars who don’t say things that are overtly controversial.  For ADD-related reasons, I found her book difficult to handle.  My usual tactic with reading books that didn’t hold my interest was to attempt to disprove, or at least seriously undermine the author’s main thesis.  This usually didn’t sway the opinion of the professor running the class, mind you.  But that wasn’t the point.  It did accomplish its main goal–proving that I had read and understood the book and that I took it seriously.  This book is a collection of essays, which made it even more difficult to overturn. 

So, let me say, Dr. Faust, I am sorry about what I wrote.  The sheer amount of underlining in my copy (which you, gentle reader, may have, if you request it from my bookmooch or paperbackswap account, for I am done with it now) demonstrates that I found much of interest, but few fat targets.  I think that your diplomatic and reasoned approach to Antebellum scholarship and culture will make you an excellent administrator for America’s oldest University.

For the rest of you, I will make a concerted effort to read your blogs this weekend.  I have been adding subscriptions this week to my bloglines account, because I am losing track of all of you with blogspot addresses, unwittingly dropping discussions on comment threads and all of that.