jump to navigation

Cybernetic Haiku September 9, 2008

Posted by caveblogem in 3QD, Constructivism, Haiku, Other, Three Quarks Daily, vocabulary.

If I have any readers left, they might remember that I used to periodically examine other blogs, sacking them for words and studying words that seemed relatively unique to them [See the “Studies on the Working Vocabulary of the Blog-O-Sphere” section of this page.]  Towards the end of that phase, I used a simple algorithm to create a Haiku out of the words that a blogger used more often than other bloggers.  Yeah, it’s kinda weird and a little too complicated to explain succinctly, but you could read some of the posts and see the project develop.  And it made sense at the time. . . 

Anyway, it bothered me at the time that I was unable to automate the process of generating a Haiku out of a bunch of random words.  It bothered me that I had to intervene in the process.  I wanted to be able to push a button and have the computer do the rest, but I didn’t yet have the skill-set that I needed.  But I do now.  So here it is.  Have some fun; click the icon below.

This project demonstrates one of my favorite things about human thought–the compulsive and unconscious ways we create meaning.  We see a string of words and our brains just automatically start making sense out of them.  Doesn’t matter that they are random.  Recently I read a blog post (I think it was in Three Quarks Daily, but I can’t seem to find it now) somebody explained a party game based on the principle (and don’t get me started on the exploitation of this quirk in hypnotism).  A person volunteers to leave the room and, upon returning, guess the pertinent details of a dream that one of the others will relate to the rest of the participants while she is out of the room.  However, no dream was told to the others during her absence.  The other participants just randomly answer the questions of the volunteer, trying to keep their answers consistent with the ones that precede them.  Thus the dream is entirely a figment of the volunteer’s imagination, and usually ends up telling the participants a little more than they want to know about the mind of the volunteer.  

Yeah, it sounds more like a dirty trick than a game.  But it is an interesting metaphor for life, too.  And I am desperately trying to tell myself that that is a good thing, these days.  If you are an optimist, you are much more likely to find happiness, because you expect to–you look for it, assuming it is there somewhere.  

Anyway, this looks to be my last extracurricular programming project until at least November, and probably even later than that, since I want to participate in NaNoWriMo again this year.  I started a new job last week and between that and the two classes I’m taking, I won’t have much time to put into this sort of thing for a while.  

When I saw that Moon Topples is blogging again I briefly toyed with the idea of setting this thing up so that it automatically posted a haiku for me each day on this site– a poor-but-efficient imitation of MTs Monday Morning Haiku posts.  But I think I’ll just ask that if any readers of this blog manage to get the machine to produce a particularly interesting poem, they post it in a comment below.

Which words do you own?–2Dolphins June 6, 2007

Posted by caveblogem in 3QD, Blogs and Blogging, Haiku, linguistics, Other, Three Quarks Daily, vocabulary.
1 comment so far

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and will continue on a somewhat weekly basis. There is an interesting (to some) analysis of the most common words here.  And there is some discussion of method here.]

Today’s volunteer is the blog 2Dolphins, which is run by a married couple in Texas.  I had to go back quite a ways, chronologically, to get a large enough sample (there are a lot of pictures and such on the blog.)  So this sample runs from September of 2005 to May 31, 2007.  This may account, partially, for the fact that it added an inordinate number of words to the database.  I guess technically it was an ordinate number, since all numbers are, by definition, ordinate (I think), but you know what I mean.  They added 1,024 words, that’s right, two to the tenth power, which is a lot.

There were 5,593 different words in the sample, which is also a new record. 

Here is a word cloud comprised of the words used more than twice by 2Dolphins but not at all by any of the other 22 blogs sampled thus far.


I was happy to see the word dolphin’s in this, but not as happy as I would have been to see dolphins or dolphin.  The very first blog I sampled (other than my own), Three Quarks Daily, used the word “dolphin.”  And Alabaster Crippens used the word “dolphins” in the sample I took from his blog.  Anyway, here’s another copy of the same cloud, in a font called “dolphin.”  Best I can do.


And here’s the Venn diagram I usually make out of these words.  The left lobe consists of words that were new in the sample, that nobody else had used, sized relative to the frequency of use.  The middle lobe consists of words that everybody has used so far, sized according to how much more frequently 2Dolphins used them in the sample.  And the right lobe consists of only one word which everyone else sampled thus far has used, but that 2Dolphins did not.


And finally, here is another effort by my Haiku-generating algorithm, which stumbled a record five times.  There weren’t enough verbs to choose from, so it kept crashing.  “Snoopy” is supposed to function as an adjective in the poem, not as a beloved cartoon dog.  It is only capitalized because it is at the start.  The dudes are snoopy.

Snoopy on the pod,
dudes from an aggregator
rename a protein.

As always, the vocabulary clouds and Haiku are the property of the volunteers, except that said volunteer may not have them taken off of my site but may otherwise do with them what they wish.  Thanks for participating, 2Dolphins!

Which words do you own?–Raincoaster March 17, 2007

Posted by caveblogem in 3QD, Blogs and Blogging, daily kos, dailykos, linguistics, Neil Gaiman, Other, Three Quarks Daily, vocabulary, writing.

I first saw raincoaster‘s blog in WordPress’s list of fast growing blogs or popular blogs or something like that.  I check in once and a while and am always entertained.  We share a passion for H. P. Lovecraft and squids and a couple of other things. 

I took a sample of her blog posts yesterday and processed them, and I was a little bummed out at first about the fact that the largest words in the cloud are tag words, and so they present little new information.  Many of the others have to do with current events, so they don’t seem like the timeless blogger fingerprints I had envisioned a day or two ago. 

Nevertheless, if you look at some of the smaller words that pop out, distinguishing her vocabulary from those of Three Quarks Daily, Daily Kos, Pretty Good on Paper, and Neil Gaiman’s Journal, you’ll see some interesting nuggets (below, click to enlarge).


On a side note about Neil Gaiman, Dan somebody (whose relationship to Mr. Gaiman I did not quite get) has done some interesting analysis of Mr. Gaiman’s blog over a long span of time.  The links, if you should wish to pursue them, are in the comment thread to the post on Mr. Gaiman here.  The analysis and method Dan used seems more sophisticated than mine.  For example, he passes Mr. Gaiman’s words through something called a “Yahoo Term Extraction API.”  If I remember my Latin roots correctly it seems to have something to do with bees.  At any rate, his analysis also takes a slightly different tack, examining “terms,” rather than words, and chopping off words that have fewer than five letters.  So what you will see are dynamic clouds of what I suspect are topical concerns of Mr. Gaiman, rather than the individual word usage.  They are fascinating to look at, however, and quite clever.

The day before yesterday I stood for several minutes looking at a book on Java at my local Barnes and Noble.  If it hadn’t had water damage, which made that funny noise and wouldn’t let me flip pages easily, I would have actually bought it.  If anyone is interested in saving me the time and trouble it would take to learn some sort of dynamic html, I would happily partner up, supplying data and analysis for the creation of some sort of acceptable widget. 

Up next, Alabaster Crippens.

A Vocabulary Cloud March 14, 2007

Posted by caveblogem in 3QD, Blogs and Blogging, daily kos, dailykos, linguistics, literature, Other, politics, tagging, Three Quarks Daily, vocabulary, web 2.0.

I collected a bigger sample of words from my own blog this morning (I did not include the recent posts on vocabulary stuff, because I might have influenced the number of unique words, as MoonTopples and strugglingwriter both pointed out.  Instead, the sample is from posts January 4 – March 9, 2007, totalling 20,000 words.  See the first post in this series if you don’t know what I am talking about.) 

Then I put them in a database with the samples from Three Quarks Daily and Daily Kos and just pulled out words that were unique to my site (words which did not appear in 3QD or Daily Kos samples at all).   Then I sorted these by the number of times they appeared on my site, assigning a font twice as large as the number of occurances on my blog (so that words appearing three times are in 6 point Verdana–I didn’t include words unique to my blog that appeared fewer than three times, because there were more than a thousand).  I then sorted them again alphabetically and the result is the vocabulary cloud below (click for a larger image). 


It is, in some ways, the opposite of the tag clouds you see in technorati, because so many of these are made up of proper names, which have been excluded from the samples I took. 

It’s like a blogger’s fingerprint.

How many words do you actually use?–Three Quarks Daily March 12, 2007

Posted by caveblogem in 3QD, Blogs and Blogging, linguistics, Other, statistical analysis, Three Quarks Daily, vocabulary, web 2.0, writing.

I said in my last post that I was going to look at the vocabularies of some other blogs to make me feel better.  And I did, but I left the analysis I did yesterday of the foremost political blog in the US at home, so I can’t post it today (and the answer is “yes,” it did make me feel better). 

I can post another blog’s results, however.  Three Quarks Daily is the first blog I ever read.  That’s not quite right.  3QD is the first thing I ever read after somebody showed it to me and said “that is a blog.”  Anyway, 3QD is still one of my favorites.  It is like getting the best of Science, The New Yorker, The Economist and several other top-notch magazines delivered daily with interesting commentary. 

It is written by at least 4 or 5 people per week and by as many as maybe 12.  So, if these people have different vocabularies, the total number of words they might use in a sample could be expected to exceed mine by a good margin.  And it did. 

In a 13,000 word sample covering March 4 – March 12, 3QD used 3,654 different words (my total, for a comparison, was 2,734 words, so they used about a third more than me).  I actually drew a much larger sample from their site than I did mine, partially because I wanted to be comprehensive, fair, and careful.  In 18,000 words 3QD used 4576 words. 

And a little regression produced the following quadratic equation [The “Pearson’s R” was actually a little better when using a cubic model, but that would imply an infinite vocabulary, and as much as I respect the team over there at 3QD I feel like I have to draw the line somewhere]:


Which, as you can see in the chart below (click to enlarge–actual data is in red, estimates are in blue) tops out around a total vocabulary in use of 5,698. 


It occurred to me this morning that it might be interesting to see how many words are shared by different blogs, as a sort of Venn diagram, perhaps.  Do bloggers speak different languages?  How much vocabulary is shared by different blogs?  Which words do they share?  Which are different?  I must know!  So I’ll be taking a look at that question after I do a couple more of these.