Which words do you own?–Raincoaster March 17, 2007

I first saw raincoaster‘s blog in WordPress’s list of fast growing blogs or popular blogs or something like that.  I check in once and a while and am always entertained.  We share a passion for H. P. Lovecraft and squids and a couple of other things. 

I took a sample of her blog posts yesterday and processed them, and I was a little bummed out at first about the fact that the largest words in the cloud are tag words, and so they present little new information.  Many of the others have to do with current events, so they don’t seem like the timeless blogger fingerprints I had envisioned a day or two ago. 

Nevertheless, if you look at some of the smaller words that pop out, distinguishing her vocabulary from those of Three Quarks Daily, Daily Kos, Pretty Good on Paper, and Neil Gaiman’s Journal, you’ll see some interesting nuggets (below, click to enlarge).


On a side note about Neil Gaiman, Dan somebody (whose relationship to Mr. Gaiman I did not quite get) has done some interesting analysis of Mr. Gaiman’s blog over a long span of time.  The links, if you should wish to pursue them, are in the comment thread to the post on Mr. Gaiman here.  The analysis and method Dan used seems more sophisticated than mine.  For example, he passes Mr. Gaiman’s words through something called a “Yahoo Term Extraction API.”  If I remember my Latin roots correctly it seems to have something to do with bees.  At any rate, his analysis also takes a slightly different tack, examining “terms,” rather than words, and chopping off words that have fewer than five letters.  So what you will see are dynamic clouds of what I suspect are topical concerns of Mr. Gaiman, rather than the individual word usage.  They are fascinating to look at, however, and quite clever.

The day before yesterday I stood for several minutes looking at a book on Java at my local Barnes and Noble.  If it hadn’t had water damage, which made that funny noise and wouldn’t let me flip pages easily, I would have actually bought it.  If anyone is interested in saving me the time and trouble it would take to learn some sort of dynamic html, I would happily partner up, supplying data and analysis for the creation of some sort of acceptable widget. 

Up next, Alabaster Crippens.

A Vocabulary Cloud March 14, 2007

I collected a bigger sample of words from my own blog this morning (I did not include the recent posts on vocabulary stuff, because I might have influenced the number of unique words, as MoonTopples and strugglingwriter both pointed out.  Instead, the sample is from posts January 4 – March 9, 2007, totalling 20,000 words.  See the first post in this series if you don’t know what I am talking about.) 

Then I put them in a database with the samples from Three Quarks Daily and Daily Kos and just pulled out words that were unique to my site (words which did not appear in 3QD or Daily Kos samples at all).   Then I sorted these by the number of times they appeared on my site, assigning a font twice as large as the number of occurances on my blog (so that words appearing three times are in 6 point Verdana–I didn’t include words unique to my blog that appeared fewer than three times, because there were more than a thousand).  I then sorted them again alphabetically and the result is the vocabulary cloud below (click for a larger image). 


It is, in some ways, the opposite of the tag clouds you see in technorati, because so many of these are made up of proper names, which have been excluded from the samples I took. 

It’s like a blogger’s fingerprint.

How many words do you actually use?–DAILY KOS March 13, 2007

I may turn this examination of blogging vocabularies, which I began in this post, into a weekly series.  Although I am automating more steps as I perfect the technique, it is still a little time-consuming to do this every day. 

Today I examine the vocabulary of one of the foremost political blogs in the United States–Daily Kos.  (I should probably note that while this was the first blog that came to mind when I decided to look at the vocabularies of some of the “A-List” blogs, it is not one that I actually read.)

In the first 13,000 words of the sample, Daily Kos used 3,053 words, which compares favorably to my total from the other day of only 2,734.  The chart below shows data from the whole sample, which included 20,000 words taken from posts beginning with Tue Mar 13, 2007 at 11:29:01 AM PDT, and ending four pages later, in 500-word increments. 


(The extrapolation in the above chart uses the equation below.)


The estimated total vocabulary of this blog is therefore 4,152 unique words.