jump to navigation

Which words do you own?–Shameless Words March 30, 2007

Posted by caveblogem in linguistics, Other, statistical analysis, vocabulary.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and will continue on a weekly basis.]

There are now 16,425 words in the database, of which 945 were added by the latest blog to be sampled: Shameless Words.  I think that’s pretty good at this point, particularly since I had to pull out big chunks of the Italian and French words with which he normally peppers his prose.  On the positive side, he added words like zephyr, whip, thug, tapestry, ivory, giddiness, audacious, and others.

I like the word cloud created from words that only Shameless Words used (of the ten blogs now sampled in the database).  The word lions comes from a recurring pictorial feature of the blog: Lions of Lyon.  Lots of intriguing things going on here, particularly the pig, pistol, robbery, safe part.  How can you look at that cloud and not want to go find out what-in-the-hell is going on there? (Do click to enlarge the image)


I find the intersection part of the Venn diagram pretty interesting this time, too.  It surprises me a little that, having sampled ten blogs and a total of more than 195,000 words, there are still so many words that everyone has used.   All ten blogs had 450 words in common. 


The largest words in the intersection were used by Shameless three times more often than the average blogger in the sample (They are sized in relation to one another, rather than on an absolute scale.  I’m almost certainly going to have to keep on doing this as words are added to the pool.)

Next up is Grasshopper Ramblings.

Another Comment on Comments March 30, 2007

Posted by caveblogem in Blogs and Blogging, Other, Rock.

I was driving my son to school and continuing our conversation on Rock music this morning.  We were listening to Only a Lad, by Oingo Boingo, one of his favorite songs [he’s nine years old and idolizes Bart Simpson, The Mythbusters, his uncle (who is into large hybrid rockets, industrial-sized fireworks, blowing things up, and sent my kid a do-it-yourself trebuchet kit for Christmas this year) and Calvin (of the Calvin and Hobbes comic strip)]. For those of you who don’t know it, the song starts like this:

Johnny was bad, even as a child everybody could tell
Everyone said if you don’t get straight
You’ll surely go to Hell

But Johnny didn’t care
He was an outlaw by the time that he was
Ten years old
He didn’t wanna do what he was told
Just a prankster, juvenile gangster

His teachers didn’t understand
They kicked him out of school
At a tender early age
Just because he didn’t want to learn things
(Had other interests)
He liked to burn things

I dropped him off and then listened to some more on the way to work and was in the process of composing a post about libertarian Ska bands of the 1980s (a shorter chapter in the school of rock posts) and I realized that I never told the story of how I got that particular Oingo Boingo album. 

I really wanted that album.  And my Mom found that out and sent it to me for Christmas, which was a little weird, because I didn’t have the slightest idea how she could have found out that I wanted it.  I had never mentioned it to her.  I hadn’t told anyone . . . except . . .  

Except I had mentioned in a comment on somebody else’s blog that I really wanted that album but didn’t have the cash to purchase it myself.  Mom googled me and found that gift idea in the comment thread.  Apparently I’m usually hard to buy for. 

Anyway, I thought I should let people know about this.  And I’ve been thinking about comments after my post yesterday.  And my birthday is coming up. . . .

A Plea to my Blogger Friends (the ones with blogspot URLs) March 29, 2007

Posted by caveblogem in blogger, bloglines, Blogs and Blogging, DIY, how to, My other blog, Other, web 2.0.

Something has troubled me for some time now.  I have some blogging friends on WordPress, and some on Blogger.  I can keep up with all of my comments on all of my WordPress friends through the comments functionality of WordPress.  Unfortunately, when I comment on blogspot blogs I have to remember that I commented, and then return there periodically repeatedly to see whether anyone had responded to my comment.  I have added more friends with blogspot addresses recently and I am finding it impossible to keep up with them.

In a perfect world we would all be on WordPress, of course.   But I can understand the reluctance of these people to migrate.  Blogger can do things that WordPress still cannot do. 

About a month ago Silver Tiger had a post about this, and he noted that you can get some comment feeds on blogger.  I tried to get some of these and found that blogger had three different varieties of comments feeds.   What I want you to do, my blogger friends, is to turn on the full comments feed on your blog.  That way, I’ll be able to keep up with you better, and so will your other good friends. 

Here’s what you need to do.

Go to the help section of the blogger website here.  It will tell you how to go to your settings tab, change your site feed to advanced mode, and then enable all three types of comment feed.  After you do this, I can, anybody can go to their favorite feed reader and pick up the feed by using the following feed URL:


Do it for me, blogspot friends.  Let us have better conversations.  To get my comment feed from this WordPress site, of course, you just click on the little rss comments icon over on the right hand side of my blog over there—>

Or use this URL:  https://caveblogem.wordpress.com/comments/feed/

I tried this out today on my blogspot blog and it works, people.  For some reason Bloglines doesn’t update the comments feeds as often as they do a full feed, but I’m O.K. with that for now.  Just so long as I don’t have to remember all this stuff all of the time.


A Matter of Taste Tests March 28, 2007

Posted by caveblogem in coffee, Other, statistical analysis.

Back a few weeks ago, just before the Academy Awards, I was listening to this interview on the radio about how sound effects are done for the movies.  This guy in the business explained to the interviewer that you couldn’t always use the sounds that people might actually hear-the actual, authentic sounds.  Sometimes you had to produce the sounds that people expected to hear.  I think that one of his examples was from Blood Diamond or perhaps Babel.  He was saying that part of the film was shot outside of the United States and some of the sounds of birds would not even have been recognized as birds by U.S. moviegoers, so he substituted them for more common sounds in the film–chickens, maybe.  I don’t know.

I bring this up because I have serious doubts about taste tests, particularly where coffee is concerned.  I saw the results of one such a few months ago and, if I remember correctly (and my memory is what experts call capricious, unpredictable, and merely adequate) Dunkin’ Donuts coffee was rated at the top, followed by Green Mountain, with Eight O’ Clock and then maybe Seattle’s Best all coming in ahead of Starbucks.

Now there is no accounting for taste, I know.  But the sound guy interview made me wonder a little.  Isn’t it possible that people doing taste tests for coffee are judging not which tastes best, but which tastes most like coffee?  And if their experience of coffee is growing up with Folgers or Maxwell House or institutional coffee from cafeteria urns, then the ones that remind them of that taste, but slightly less bitter, will rise to the top in a test?

I wonder about this because I didn’t really take my coffee drinking seriously at all until I moved to the Pacific Northwest.  There I went to a school where Starbucks was pretty much the only coffee served.  And it was relatively cheap.  For a small fee the vendor supplied people with a thermos mug (see below, click picture to enlarge).  Refills for the mug pictured below (20 oz, I think, which was the third largest size available–and this was long before people started getting iced coffee drinks) were 74 cents.


To me, that stuff was coffee.  You didn’t need to put cream or sugar or anything in it.  It was sweet and not bitter, and had a lot of flavor. 

A couple of years ago my Mom was out visiting and we decided to take the train into Boston.  There was a Dunk’s at the train station here in Lowell and we stopped to get some coffee and then rushed onto the train. The Dunk’s had a foul smell, but I didn’t think much about it at the time.  Mom took one sniff of her coffee and threw it in the trash before we even got on the train.  I kept mine, and tried several times to drink some, because I really needed the caffeine.  I just couldn’t do it.  When we got to the North Station I dumped it into the trash.  It smelled foul.  But the odd thing is, it smelled just like the kind of coffee I used to get at a cafeteria in San Francisco, which was made by an immigrant couple who drank only tea.

Digital Signatures March 27, 2007

Posted by caveblogem in DIY, how to, lifehack, Memory, Other.

I took something to the post office a couple of weeks ago and when I paid for whatever it was with my debit card the postal employee asked me for my “digital signature.”  I was a little confused, but I picked up the weird stylus thing and prepared to sign something pressure-sensitive, assuming that she misunderstood and thought I was charging it as credit.  “No,” she said, “your digital signature.  Your pin number.”

“Oh,” I said.  “I’ve never heard them called that before.”

“That’s what the Postmaster wants us to call them,” she apologized.

Driving back to work I was thinking about the differences between handwritten signatures and these four-digit codes, which brought to mind all of the passwords we are plagued with these days, and by “we” I mean us people who are in the blogosphere. 

Pins are like signatures because they, in combination with the numbers on the physical card, are unique and identifiable.

Unlike signatures, however, they are so plain.  And in this age of personalized ringtones and the ability to choose to dress in combinations of clothes from any era, they seem so mass-produced.  When I was a senior in high school I changed my signature so that it didn’t really look all that much like my name anymore.  But it was interesting-looking and very distinctive. 

I lost my debit card last month and I called the bank and they issued me a new one, with a new pin.  They told me that I could change the pin whenever I want.  But I suspect that doing so would make it easier for other people to guess.  So I decided to keep it.  But I personalized it as an attempt to memorize it quickly, nonetheless.  Here’s how:

I associate all numbers with consonants according to the following, standard (yes, people all over the English-speaking world use this same one, and I don’t know where it came from) scheme:

  • 1=T or D
  • 2=N
  • 3=M
  • 4=R
  • 5=L
  • 6=Sh, J, Zh, or Ch
  • 7=K
  • 8=F or V
  • 9=B or P
  • 0=S or soft C

So that the new pin immediately formed a couple of distinctive, easy-to-remember words in my mind, because you can vary the consonants a little (Nos 1, 6, 8, 9, and 0) and you can vary the vowells considerably, leaving them out or putting them in to make words or names.  I chose the most distinctive and repeated it a couple of times and can never forget it.

So it is a little more like a signature to me.  But I realized that nobody else could ever marvel at my cool pin number, ’cause that would defeat the purpose of the thing. 

Oh, well.  Just as I was leaving the post office the postal worker told me that the postmaster wants them all to call pencils “graphite dispensers.”  I laughed, but I don’t think she was kidding.  That’s like calling a rifle a “hot lead dispenser.”  Doesn’t say much about the function of the object, does it?

Which words do you own?–Moon Topples March 26, 2007

Posted by caveblogem in Blogs and Blogging, history, linguistics, Other, statistical analysis, tagging, vocabulary.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and continue even as I type this word.]

Despite Mr. Topples fears of not adding substantially to this database of words in use in the blogosphere, he brought 1,332 new words with his posts–not the record, but damn close.  Here’s a cloud made up of words that he and only he used of the nine blogs I have sampled thus far (click to enlarge image):


And here’s the other cloud diagram that shows, additionally, words that all other blogs so far have used but Mr. Topples did not, as well as words that all nine blogs used and Mr. Topples used more often (also click to enlarge).


The composition of both of these clouds is, I see, heavily influenced by the short-story contest that he ran, and some of the entries were included, since they were posted in time frame which included the sample.   The word vote is one such.  Words relating to vision, like eyes, and saw are also from that, I think.  I’m interested in the size of the word history, one of the words that everyone but Mr. Topples used.  Does this paint him as a thoroughly post-modern gen-xer?  I wonder.

I enjoyed putting together a story for that contest back when I was writing fiction.  I hope to resume doing so again soon.  So I’m now going to restrict these vocabulary of the blogosphere posts to once per week.  I know I said I’d do that before, but not I’m serious.   Really.

In related news, Anxious MoFo has developed a program in Perl now that samples words in a different but also interesting way.  Click here to bug him to reveal his coding secrets.

Finally, I’m still looking for a volunteer for the next sample.   If you’re interested, just let me know in the comment thread (so that others will know as soon as I get one). 

Which words do you own?–kuipercliff March 23, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, Other, vocabulary, writing.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and continue even as I type this word.]

I wasn’t familiar with kuipercliff‘s blog until he asked to be the next subject in this series on the vocabulary of the blogosphere.  Since that time I have begun to read it.  There are some very interesting posts there.  Much of the blogosphere, I think, is more like kuipercliff’s blog than it is like the other blogs included thus far in my database.  What I mean is that his blog is centered, topically, on information technology to a much larger extent than others I have examined.  Even so, it covers a wide range of other subjects and uses an astonishing variety of words, 1,741 of which nobody had yet used.

The inclusion of Mr. Kuipercliff’s words also presented more quandries than I would have liked to deal with.  I decided to allow the word commodification, even though it was not considered a word by either MS Word’s dictionary or the no-longer final authority on these matters–Webster’s New Twentieth Century Dictionary (unabridged, 1943).  Up until I started this project I usually employed this bulky tome (see pic below–click to enlarge) for pressing four-leaf clovers and certain paper-folding projects.  It will now go back to that function.  Commodification is just the sort of Marxian term that lexicologists were uncomfortable with in 1943.  Sure, it’s jargon.  I don’t care.  To Webster’s credit, it supported me, supported kuipercliff, on whinging–which did make the cut. 


I didn’t allow disinhibition, which is, I suspect, also jargon, or at least a double-negative.  I have been studying neural networks recently and have found no need for such a word, or at least no instance in which it could not have been replaced by excitation or stimulation or some other reasonably well-worn word.  I’m willing to listen to reason, though, if somebody thinks that it should make the cut. 

For a couple of other words I consulted the new final authority on such matters, my wife, an English professor.  She didn’t see a problem with amygdala or neocortex (which were in kuipercliff’s sample), or even xiphoid, maxillary, hyoidal or other odd anatomical terms I threw at her to make my point about why these things should be excluded.  So, from now on, anatomical terms are fair game. 

Foreign languages are still out, however, unless they are the sort of foreign languages that are part of a long list of anatomical terms.   In other words, everyday Latin terms, like redux, are out.

Here’s a cloud made up of the words that, thus far, only kuipercliff has used.  I was only able to include the ones that he used four or more times because he used so many words that others had not (click to enlarge).


And here is a Venn diagram showing the words that only kuipercliff used, the words that he shared with everybody else, and the ones that others used but that he did not.


Next up: Moon Topples

Which words do you own?–Stiletto Girl March 22, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, Other, statistical analysis, writing.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and seem to go on forever.]

Moontopples said last night,

Each blog you do means I’ll add less and less, but I’d love to be included. Maybe I’ll be your first subject to add no words at all.

I have also been concerned about that, Moontopples.  But things don’t seem to be winding down at all.  In fact, today’s subject, Stiletto Girl  added a remarkable 1,292 unique words from her sample of 21,000.  Her frequently used additions to the vocabulary database are represented in the vocabulary cloud below (Click to enlarge the picture.  Words are in font sizes double the number of times the words appeared). 


 I have to admit that I had to look at least one of them up.  I think she said that people should feel free to analyze her in the comment thread, but I may just have been hearing things. 

Here is the other diagram that I wanted to unveil today (below, click to enlarge).  It shows that other blogs used only 43 words that SG did not use (the right-hand lobe of the diagram below).  I think that’s just as amazing as the addition of nearly 1,300 words at this point in the game.  She’s the Motmistress of the Hour, clearly.


The middle of the diagram, the intersection of the sets, shows words that other blogs used as well, but only the ones that SG used a lot more often than others (due to space considerations–there were a lot of these).  It reminds me of a diagram a psychology professor once drew on the blackboard for my class.  On the left, the Id.  On the Right, the Superego.  In the middle, the Ego.  Or was that philosophy class?  In that case, on the left is evil.  On the right, good.  In the middle, the eternal conflict waged between the forces of darkness and light. 

I have noticed that many people tend to use noms de blog which are composed of regular words, rather than proper names.  Watching Speed Racer last night with my son I noticed that many of the villains and other characters are similarly named.  Speed Racer, Snake Oil, Cruncher Block, Inspector Detector, Rock Force, Racer X (yes, there are glaring exceptions, most notably Spritle and Trixie).  Is this all a part of what Douglas Coupland (I think it was him) called the “Hello Kittyfication of America”?  Or do I just need more sleep?

Next up: kuipercliff, followed by Mr. Topples.

On Constructing a Vocabulary Cloud March 20, 2007

Posted by caveblogem in Blogs and Blogging, DIY, how to, linguistics, Other, vocabulary.

[This is in response to Stiletto Girl‘s request for more information on how these word clouds were constructed, which was part of the comment thread to this post]

I was kinda hoping that nobody would ask me about how I actually make these things, because while it is ingenious, I am also well aware that it shows off my almost total ignorance of useful programming languages.  If I possessed such knowledge several of the steps below would be much, much quicker and easier.  As it is, it takes me an hour and one-half to crunch through one of these.  Pathetic.

First, I went to the blogs of these people and copied their posts into Microsoft Word.  Then I created some macros that did a few different things with the find/replace function of MS Word.  They stripped out all of the punctuation marks, extra spaces, made all of the letters lower-case, and got rid of common names (names for months, days, products, websites, etc.).  Having run these, I then looked through the documents for mis-spelled words (words that MS Word said were mis-spelled.)  These were usually of three types: actual mis-spellings of words, (commonly thier, instead of their, stuff like that) names, (but names that were not also regular words like stiletto and girl) and, finally, words that MS Word simply didn’t know. 

I corrected the mis-spellings and deleted the names.  For the third category you’ll have to take my word for my ability to recognize English words.  (The only reason I was able to get into graduate school at all with a 2.54 GPA in my undergraduate years was that I aced the Verbal section on the GREs, which to the department in question indicated a strong probability of sucess in their program.)  If I recognized a word that MS didn’t, I added it to the MS Word dictionary.

Then I used find/replace once more to make one long column of words by replacing spaces with paragraph marks.  Then the analysis had to leave MS Word in favor of MS Excel.  I pasted the column of words into Excel and created a column of numbers next to it, so that words could remain associated with the order in which they appeared in the sample.  I then sorted the words alphabetically, which usually unearthed a few punctuation marks or other characters that I had not eradicated with the macros in Word.  I deleted these and then sorted by the column of numbers (again, to retain words in the order in which they originally appeared.  This step is not necessary for the construction of the Vocabulary Cloud, but it is the only way to do the sort of curve-fitting that is shown in the first few posts on this subject). 

Then the analysis had to leave Microsoft products entirely in favor of statistical processing software, SPSS, in this case.  I imported the words and numbers into SPSS and sorted them again by word.  I did this because SPSS and Excel sort differently, so I usually got rid of a few more things, bullets, those arrows that the French use to begin and close quotations.  Then I ran a frequency count, which is a very simple thing to do in SPSS.  Then I saved the frequency count, which consists of words and the number of times they appeared in the sample, as a separate file.  Then I merged this frequency file with a file that I had built out of other word freqency files.  That file looks like this (click picture to enlarge):


As you can see, the words are listed in the column on the left, with the number of times they were used in the 20,000-word samples from various blogs (Niel Gaiman’s Journal, Daily Kos, Raincoaster, Three Quarks Daily, Pretty Good on Paper, and Alabaster Crippens doesn’t know what’s going on) in subsequent columns.

Then I had to create a variable that showed whether the word I imported from the new fequency counts was new or had appeared before (this appears in the far right column).  This is pretty easy to do in SPSS, because you can compute new variables from existing ones using simple logic statements.  The logic for EelKat’s, OnlyEK, looked like this:

OnlyEK = FreqEK if (Freq3Qd  < 1 & FreqsKos < 1 & FreqPGP < 1 & FreqsNG < 1 & FreqsRC < 1 & FreqsAC < 1 & FreqEK >= 1)

Then I sorted OnlyEK, so that the words she used most frequently (and nobody else had yet used) would appear at the top, with their frequency counts. 

Then I copied and pasted this list into MSWord for the final step, changing the font sizes to correspond to frequencies.  Then I just got rid of the paragraphs and voila!  Vocabulary Cloud.

Which words do you own?–EelKat March 19, 2007

Posted by caveblogem in Blogs and Blogging, linguistics, Other, tagging, vocabulary, writing.

[Note: This is part of a continuing series on the actual vocabulary in use in the blogosphere.  Posts on this subject started here and seem to go on forever.]

Eelkat added 900 new words to the pile, from a sample of 20,000 words taken from her posts spanning February 26 to March 17, 2007.

The words in the cloud below (click picture to enlarge) appeared three or more times in the sample and not at all in the six other blogs that comprise the database.  They appear in a font size double the number times they appeared, except for “z-list,” which appeared seventy times and would have been too large at 130 point type. 


There are now 12,799 different words in the database, collected from sampling 110,000 words from seven blogs.

If anyone is still reading this and wants to be next, just link to this site and let me know.  If I don’t hear from anyone today I’ll probably do another A-list blog.  And they already have enough links; they don’t need another from me. 

I think that the next analysis will include three wordclouds put into a Venn Diagram that looks like this:


So there is a special bonus to the next guinea pig volunteer.