jump to navigation

Which words do you own?–Doors Left Open May 19, 2007

Posted by caveblogem in Blogs and Blogging, Haiku, linguistics, Other, statistical analysis, vocabulary.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and will continue on a somewhat weekly basis.]

This week’s volunteer is Canterbury Soul, a Singaporean, whose blog is called Doors Left Open.  His took a little longer than others I have done because it is such a new blog that I feared the sample would be too small.  I like to take at least 25,000, normally, and I wanted to make sure that the sample was big enough to yield some new words.  As it turned out, that wasn’t a significant problem.  I took a sample of slightly more than 20,000 (a census, really) and got rid of all of the proper names, like I normally do, and ended up with about 19,000 (including the pages of the blog that don’t run chronologically, actually).  But Canterbury Soul still added a hefty 682 new words to the database.

Here’s the resulting cloud composed of words that did not yet show up in any of the seventeen blogs sampled (click to enlarge).


And here’s the same cloud in a font called “Open Mind.”  I couldn’t find any fonts related to doors. . . .


And here’s the Venn diagram I usually make out of these words.  The left lobe consists of words that were new in the sample, that nobody else had used, sized relative to the frequency of use.  The middle lobe consists of words that everybody has used so far, sized according to how much more frequently Canterbury Soul used them in the sample.  And the right lobe consists of words which everyone else sampled thus far has used, but that Canterbury Soul did not, sized by freqeuency of use.


And finally, here is the Haiku generated by my Haiku-generating algorithm, which is improving rapidly, I think.  The words are those of Canterbury Soul, of course.  The arrangement is now almost purely mechanical.

Woeful on the oak,
germs of a paroxysm
recover midfield.

Next Up: Dayngrous Discourse, then Second Effort

Which words do you own?–Asara’s Mental Meanderings May 2, 2007

Posted by caveblogem in Blogs and Blogging, Haiku, linguistics, Other, statistical analysis, vocabulary.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and will continue on a somewhat weekly basis.]

A long time ago I worked for the State of California as an economist, and one of the more fun aspects of the job was preparing estimates, all kinds of estimates.  I had read an interview once with a somebody who was part of the rather large team of people that prepared estimates of the Gross National Product [GNP] (yeah, it’s now called the Gross Domestic Product [GDP].  This person did this in the 1960s, if I remember correctly.)  The guy said that it was like making sausage, because economists use it all the time, but they don’t really want to know what goes into it.  I took to calling my place of employment the sausage factory, until it was clear that nobody got the joke or was interested in why I called it that.  It didn’t really catch on. 

Anyway, Asara, today’s blogging vocabulary project volunteer, of Asara’s Mental Meanderings, actually works in a hot dog factory, so I took the liberty of rendering (yeah, that’s covert livestock humor) her vocabulary cloud in a font called Hot Dog.  So, for those of you who are new, these are words that out of the more than 20,000 distinct words in the database had not yet appeared.  Their size varies according to the frequency with which they appeared in Asara’s sampled posts (click to enlarge).


Asara added 392 new words to the database, out of a sample of 30,000.  Here is the cloud in Times New Roman font, for easier reading (click to enlarge).


There are lots of really cool words here, and cool juxtapositions of words.  Ranger, guild, paladin, are those Dungeons and Dragons words (or possibly whatever has replaced D&D–WoW)?

Anyway, the middle lobe of the Venn diagram is somewhat interesting. I was a little surprised that I used some of those words in my sample, particularly god (click to enlarge).


Here’s something new.  I was doing another project, one for which I was technically getting paid, that forced me to learn to use logical operators in Microsoft Excel.  And a strange combination of things I am reading (about which I might later post) got me thinking a little bit about Haiku.  So, to make a long story short, I have a large and growing database of words on hand, so I decided to take a crack at building a machine that uses them to generate a Haiku.  I will post some things about the various algorithms when it works a little better, when I have most of the bugs worked out.  For now, here is its first effort:

jogging near a trucker of
brain-dead gorillas

It’s not as enigmatic as I hoped it would be.  But not bad for a first effort, I think. 

The next volunteer is Are We There Yet?, slated for early next week.

Which words do you own?–the108 April 26, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, Other, parenthood, vocabulary.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and will continue on a somewhat weekly basis.]

The108 is the first “Mommy Blog” to be added to the sample.  You probably could guess that from looking at the vocabulary cloud below, which contains words commonly used by the108 that are completely new to the database.


In all, the108 added 660 words to the database (from a sample of 30,000 or so, taken from posts April 1 – April 21, 2007).


The108 is going through a tough time in her life right now.  And I almost didn’t post this, thinking that I certainly don’t want to add to her troubles by analyzing her vocabulary.  I decided to post this anyway as a sort of a reminder that there are highs as well as lows.  They will come soon, I feel sure.  Go to her blog, tell her that your thoughts are with her. 

Here’s another cloud of words used by the 108 just two times in her posts (the font is called “Damn Noisy Kids”):


I’m going to return to her blog this summer and I am going to be sending positive thoughts her way.  I’ll take another sample and I hope to see more joy and hope in it.  I am counting on it.

Which words do you own?–Mags April 20, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, Other, statistical analysis, vocabulary, writing.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and will continue on a weekly basis.]

With a sample of 24,994 words taken (February 19 – April 10, 2007) from her blog, Ms. Maggie Moo Talks 2 U, added 652 words to the database.  What sort of words?  This cloud shows the new words that Mags used most frequently:


Makes me hungry, most of it.  The way the second-to-last word dominates the picture is what we call “irony.” Mags managed to use 3,579 different words in all.

The Venn diagram that I usually produce for this analysis is a little strange, this time.  Apparently Mags uses lots of words that you, dear reader, do not.  And the she uses all of the words that you commonly use, too.  So the diagram is strangely lop-sided.  To restate that once again from the other direction, there were no words used by all of the other blogs sampled that Mags did not end up using. 


So the answer to the question “which words do you own?” is, in Mags’ case, a lot of them.  If this happens again I’ll have to come up with a different sort of diagram. 

Which words do you own?–Daniel Meissler April 13, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, Other, vocabulary, web 2.0, writing.
1 comment so far

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and will continue on a weekly basis.]

Below we have the vocabulary cloud for words added to the database by the blog of my latest volunteer, Daniel Miessler


An interesting juxtaposition of “atheism” with “icons.”  It seems to call forth a new religion, some sort of virtual-objectivist desktop non-worship, the Church of the Apathetic Blogger, perhaps.  (I read his post on atheism and it does no such thing, of course.)  Then a little further down we have “freebase” and “grandpa,” which I don’t think was about the Keith Richards’ remarks from last week. 

Perhaps my favorite string begins with the word “server,” and ends with “waterfall.”  It is a “found” steampunk haiku.

Mr. Meissler’s Venn diagram looks like this:


There are a couple of words that pop out that I am thinking about dumping in future diagrams–“march” and “john.”  Microsoft Word recognizes these as words instead of proper names (because one of the macros I run uncapitalizes all words), but I’m pretty sure that bloggers are not using them as such.  It says something about the commonality of the name John, and the month of the year these samples drew heavily upon. 

Mr. Miessler added 528 new words to the database, pretty respectable, this late in the game.  His is the second technologically-leaning blog to be sampled.  I have high hopes that the next blog to be added, Ms Maggie Moo Talks 2 U, a blog that leans towards posts on cooking and restaurant events, will add a whole new sheaf of words. 

Which words do you own?–Shameless Words March 30, 2007

Posted by caveblogem in linguistics, Other, statistical analysis, vocabulary.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and will continue on a weekly basis.]

There are now 16,425 words in the database, of which 945 were added by the latest blog to be sampled: Shameless Words.  I think that’s pretty good at this point, particularly since I had to pull out big chunks of the Italian and French words with which he normally peppers his prose.  On the positive side, he added words like zephyr, whip, thug, tapestry, ivory, giddiness, audacious, and others.

I like the word cloud created from words that only Shameless Words used (of the ten blogs now sampled in the database).  The word lions comes from a recurring pictorial feature of the blog: Lions of Lyon.  Lots of intriguing things going on here, particularly the pig, pistol, robbery, safe part.  How can you look at that cloud and not want to go find out what-in-the-hell is going on there? (Do click to enlarge the image)


I find the intersection part of the Venn diagram pretty interesting this time, too.  It surprises me a little that, having sampled ten blogs and a total of more than 195,000 words, there are still so many words that everyone has used.   All ten blogs had 450 words in common. 


The largest words in the intersection were used by Shameless three times more often than the average blogger in the sample (They are sized in relation to one another, rather than on an absolute scale.  I’m almost certainly going to have to keep on doing this as words are added to the pool.)

Next up is Grasshopper Ramblings.

Which words do you own?–Moon Topples March 26, 2007

Posted by caveblogem in Blogs and Blogging, history, linguistics, Other, statistical analysis, tagging, vocabulary.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and continue even as I type this word.]

Despite Mr. Topples fears of not adding substantially to this database of words in use in the blogosphere, he brought 1,332 new words with his posts–not the record, but damn close.  Here’s a cloud made up of words that he and only he used of the nine blogs I have sampled thus far (click to enlarge image):


And here’s the other cloud diagram that shows, additionally, words that all other blogs so far have used but Mr. Topples did not, as well as words that all nine blogs used and Mr. Topples used more often (also click to enlarge).


The composition of both of these clouds is, I see, heavily influenced by the short-story contest that he ran, and some of the entries were included, since they were posted in time frame which included the sample.   The word vote is one such.  Words relating to vision, like eyes, and saw are also from that, I think.  I’m interested in the size of the word history, one of the words that everyone but Mr. Topples used.  Does this paint him as a thoroughly post-modern gen-xer?  I wonder.

I enjoyed putting together a story for that contest back when I was writing fiction.  I hope to resume doing so again soon.  So I’m now going to restrict these vocabulary of the blogosphere posts to once per week.  I know I said I’d do that before, but not I’m serious.   Really.

In related news, Anxious MoFo has developed a program in Perl now that samples words in a different but also interesting way.  Click here to bug him to reveal his coding secrets.

Finally, I’m still looking for a volunteer for the next sample.   If you’re interested, just let me know in the comment thread (so that others will know as soon as I get one). 

Which words do you own?–kuipercliff March 23, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, Other, vocabulary, writing.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and continue even as I type this word.]

I wasn’t familiar with kuipercliff‘s blog until he asked to be the next subject in this series on the vocabulary of the blogosphere.  Since that time I have begun to read it.  There are some very interesting posts there.  Much of the blogosphere, I think, is more like kuipercliff’s blog than it is like the other blogs included thus far in my database.  What I mean is that his blog is centered, topically, on information technology to a much larger extent than others I have examined.  Even so, it covers a wide range of other subjects and uses an astonishing variety of words, 1,741 of which nobody had yet used.

The inclusion of Mr. Kuipercliff’s words also presented more quandries than I would have liked to deal with.  I decided to allow the word commodification, even though it was not considered a word by either MS Word’s dictionary or the no-longer final authority on these matters–Webster’s New Twentieth Century Dictionary (unabridged, 1943).  Up until I started this project I usually employed this bulky tome (see pic below–click to enlarge) for pressing four-leaf clovers and certain paper-folding projects.  It will now go back to that function.  Commodification is just the sort of Marxian term that lexicologists were uncomfortable with in 1943.  Sure, it’s jargon.  I don’t care.  To Webster’s credit, it supported me, supported kuipercliff, on whinging–which did make the cut. 


I didn’t allow disinhibition, which is, I suspect, also jargon, or at least a double-negative.  I have been studying neural networks recently and have found no need for such a word, or at least no instance in which it could not have been replaced by excitation or stimulation or some other reasonably well-worn word.  I’m willing to listen to reason, though, if somebody thinks that it should make the cut. 

For a couple of other words I consulted the new final authority on such matters, my wife, an English professor.  She didn’t see a problem with amygdala or neocortex (which were in kuipercliff’s sample), or even xiphoid, maxillary, hyoidal or other odd anatomical terms I threw at her to make my point about why these things should be excluded.  So, from now on, anatomical terms are fair game. 

Foreign languages are still out, however, unless they are the sort of foreign languages that are part of a long list of anatomical terms.   In other words, everyday Latin terms, like redux, are out.

Here’s a cloud made up of the words that, thus far, only kuipercliff has used.  I was only able to include the ones that he used four or more times because he used so many words that others had not (click to enlarge).


And here is a Venn diagram showing the words that only kuipercliff used, the words that he shared with everybody else, and the ones that others used but that he did not.


Next up: Moon Topples

Which words do you own?–Stiletto Girl March 22, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, Other, statistical analysis, writing.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and seem to go on forever.]

Moontopples said last night,

Each blog you do means I’ll add less and less, but I’d love to be included. Maybe I’ll be your first subject to add no words at all.

I have also been concerned about that, Moontopples.  But things don’t seem to be winding down at all.  In fact, today’s subject, Stiletto Girl  added a remarkable 1,292 unique words from her sample of 21,000.  Her frequently used additions to the vocabulary database are represented in the vocabulary cloud below (Click to enlarge the picture.  Words are in font sizes double the number of times the words appeared). 


 I have to admit that I had to look at least one of them up.  I think she said that people should feel free to analyze her in the comment thread, but I may just have been hearing things. 

Here is the other diagram that I wanted to unveil today (below, click to enlarge).  It shows that other blogs used only 43 words that SG did not use (the right-hand lobe of the diagram below).  I think that’s just as amazing as the addition of nearly 1,300 words at this point in the game.  She’s the Motmistress of the Hour, clearly.


The middle of the diagram, the intersection of the sets, shows words that other blogs used as well, but only the ones that SG used a lot more often than others (due to space considerations–there were a lot of these).  It reminds me of a diagram a psychology professor once drew on the blackboard for my class.  On the left, the Id.  On the Right, the Superego.  In the middle, the Ego.  Or was that philosophy class?  In that case, on the left is evil.  On the right, good.  In the middle, the eternal conflict waged between the forces of darkness and light. 

I have noticed that many people tend to use noms de blog which are composed of regular words, rather than proper names.  Watching Speed Racer last night with my son I noticed that many of the villains and other characters are similarly named.  Speed Racer, Snake Oil, Cruncher Block, Inspector Detector, Rock Force, Racer X (yes, there are glaring exceptions, most notably Spritle and Trixie).  Is this all a part of what Douglas Coupland (I think it was him) called the “Hello Kittyfication of America”?  Or do I just need more sleep?

Next up: kuipercliff, followed by Mr. Topples.

On Constructing a Vocabulary Cloud March 20, 2007

Posted by caveblogem in Blogs and Blogging, DIY, how to, linguistics, Other, vocabulary.

[This is in response to Stiletto Girl‘s request for more information on how these word clouds were constructed, which was part of the comment thread to this post]

I was kinda hoping that nobody would ask me about how I actually make these things, because while it is ingenious, I am also well aware that it shows off my almost total ignorance of useful programming languages.  If I possessed such knowledge several of the steps below would be much, much quicker and easier.  As it is, it takes me an hour and one-half to crunch through one of these.  Pathetic.

First, I went to the blogs of these people and copied their posts into Microsoft Word.  Then I created some macros that did a few different things with the find/replace function of MS Word.  They stripped out all of the punctuation marks, extra spaces, made all of the letters lower-case, and got rid of common names (names for months, days, products, websites, etc.).  Having run these, I then looked through the documents for mis-spelled words (words that MS Word said were mis-spelled.)  These were usually of three types: actual mis-spellings of words, (commonly thier, instead of their, stuff like that) names, (but names that were not also regular words like stiletto and girl) and, finally, words that MS Word simply didn’t know. 

I corrected the mis-spellings and deleted the names.  For the third category you’ll have to take my word for my ability to recognize English words.  (The only reason I was able to get into graduate school at all with a 2.54 GPA in my undergraduate years was that I aced the Verbal section on the GREs, which to the department in question indicated a strong probability of sucess in their program.)  If I recognized a word that MS didn’t, I added it to the MS Word dictionary.

Then I used find/replace once more to make one long column of words by replacing spaces with paragraph marks.  Then the analysis had to leave MS Word in favor of MS Excel.  I pasted the column of words into Excel and created a column of numbers next to it, so that words could remain associated with the order in which they appeared in the sample.  I then sorted the words alphabetically, which usually unearthed a few punctuation marks or other characters that I had not eradicated with the macros in Word.  I deleted these and then sorted by the column of numbers (again, to retain words in the order in which they originally appeared.  This step is not necessary for the construction of the Vocabulary Cloud, but it is the only way to do the sort of curve-fitting that is shown in the first few posts on this subject). 

Then the analysis had to leave Microsoft products entirely in favor of statistical processing software, SPSS, in this case.  I imported the words and numbers into SPSS and sorted them again by word.  I did this because SPSS and Excel sort differently, so I usually got rid of a few more things, bullets, those arrows that the French use to begin and close quotations.  Then I ran a frequency count, which is a very simple thing to do in SPSS.  Then I saved the frequency count, which consists of words and the number of times they appeared in the sample, as a separate file.  Then I merged this frequency file with a file that I had built out of other word freqency files.  That file looks like this (click picture to enlarge):


As you can see, the words are listed in the column on the left, with the number of times they were used in the 20,000-word samples from various blogs (Niel Gaiman’s Journal, Daily Kos, Raincoaster, Three Quarks Daily, Pretty Good on Paper, and Alabaster Crippens doesn’t know what’s going on) in subsequent columns.

Then I had to create a variable that showed whether the word I imported from the new fequency counts was new or had appeared before (this appears in the far right column).  This is pretty easy to do in SPSS, because you can compute new variables from existing ones using simple logic statements.  The logic for EelKat’s, OnlyEK, looked like this:

OnlyEK = FreqEK if (Freq3Qd  < 1 & FreqsKos < 1 & FreqPGP < 1 & FreqsNG < 1 & FreqsRC < 1 & FreqsAC < 1 & FreqEK >= 1)

Then I sorted OnlyEK, so that the words she used most frequently (and nobody else had yet used) would appear at the top, with their frequency counts. 

Then I copied and pasted this list into MSWord for the final step, changing the font sizes to correspond to frequencies.  Then I just got rid of the paragraphs and voila!  Vocabulary Cloud.