jump to navigation

On Constructing a Vocabulary Cloud March 20, 2007

Posted by caveblogem in Blogs and Blogging, DIY, how to, linguistics, Other, vocabulary.
trackback

[This is in response to Stiletto Girl‘s request for more information on how these word clouds were constructed, which was part of the comment thread to this post]

I was kinda hoping that nobody would ask me about how I actually make these things, because while it is ingenious, I am also well aware that it shows off my almost total ignorance of useful programming languages.  If I possessed such knowledge several of the steps below would be much, much quicker and easier.  As it is, it takes me an hour and one-half to crunch through one of these.  Pathetic.

First, I went to the blogs of these people and copied their posts into Microsoft Word.  Then I created some macros that did a few different things with the find/replace function of MS Word.  They stripped out all of the punctuation marks, extra spaces, made all of the letters lower-case, and got rid of common names (names for months, days, products, websites, etc.).  Having run these, I then looked through the documents for mis-spelled words (words that MS Word said were mis-spelled.)  These were usually of three types: actual mis-spellings of words, (commonly thier, instead of their, stuff like that) names, (but names that were not also regular words like stiletto and girl) and, finally, words that MS Word simply didn’t know. 

I corrected the mis-spellings and deleted the names.  For the third category you’ll have to take my word for my ability to recognize English words.  (The only reason I was able to get into graduate school at all with a 2.54 GPA in my undergraduate years was that I aced the Verbal section on the GREs, which to the department in question indicated a strong probability of sucess in their program.)  If I recognized a word that MS didn’t, I added it to the MS Word dictionary.

Then I used find/replace once more to make one long column of words by replacing spaces with paragraph marks.  Then the analysis had to leave MS Word in favor of MS Excel.  I pasted the column of words into Excel and created a column of numbers next to it, so that words could remain associated with the order in which they appeared in the sample.  I then sorted the words alphabetically, which usually unearthed a few punctuation marks or other characters that I had not eradicated with the macros in Word.  I deleted these and then sorted by the column of numbers (again, to retain words in the order in which they originally appeared.  This step is not necessary for the construction of the Vocabulary Cloud, but it is the only way to do the sort of curve-fitting that is shown in the first few posts on this subject). 

Then the analysis had to leave Microsoft products entirely in favor of statistical processing software, SPSS, in this case.  I imported the words and numbers into SPSS and sorted them again by word.  I did this because SPSS and Excel sort differently, so I usually got rid of a few more things, bullets, those arrows that the French use to begin and close quotations.  Then I ran a frequency count, which is a very simple thing to do in SPSS.  Then I saved the frequency count, which consists of words and the number of times they appeared in the sample, as a separate file.  Then I merged this frequency file with a file that I had built out of other word freqency files.  That file looks like this (click picture to enlarge):

cloudmethodpic.jpg

As you can see, the words are listed in the column on the left, with the number of times they were used in the 20,000-word samples from various blogs (Niel Gaiman’s Journal, Daily Kos, Raincoaster, Three Quarks Daily, Pretty Good on Paper, and Alabaster Crippens doesn’t know what’s going on) in subsequent columns.

Then I had to create a variable that showed whether the word I imported from the new fequency counts was new or had appeared before (this appears in the far right column).  This is pretty easy to do in SPSS, because you can compute new variables from existing ones using simple logic statements.  The logic for EelKat’s, OnlyEK, looked like this:

OnlyEK = FreqEK if (Freq3Qd  < 1 & FreqsKos < 1 & FreqPGP < 1 & FreqsNG < 1 & FreqsRC < 1 & FreqsAC < 1 & FreqEK >= 1)

Then I sorted OnlyEK, so that the words she used most frequently (and nobody else had yet used) would appear at the top, with their frequency counts. 

Then I copied and pasted this list into MSWord for the final step, changing the font sizes to correspond to frequencies.  Then I just got rid of the paragraphs and voila!  Vocabulary Cloud.

Advertisements

Comments»

1. Stiletto Girl - March 20, 2007

Don’t undermine yourself, I think it’s brilliant! Do me, do me!

And thank you for the explanation. Leave it to me to be difficult [by asking the tough questions] heehee

Have you posted a list of words that MS Word does not recognize? Would be interesting to see. Can’t you add new words in Word? It’s been a long time since I’ve used it. I’m assuming you can do it under SpellCheck which, by the way, is something I never use. I should run the test on myself to see how many words I have NOT gotten away with flubbing.

2. caveblogem - March 20, 2007

SG,

That part must have been confusing in the above. I do the spell-check in MS Word and I do add new words there. I haven’t posted a list of words that MS Word doesn’t recognize. I wasn’t even able to get an estimate of how many different words are in the program’s basic dictionary, although I only tried for a few minutes. You’d think that would be part of their sales pitch or something. . . .

Anyway, thanks for the compliment. And consider yourself done, Stiletto Girl, [ASAP].

3. Stiletto Girl - March 20, 2007

I wonder how many times the f word will show up. Now I am nervous. I am going to take a pill!

4. Stiletto Girl - March 20, 2007

It would be also be interesting if you could import the Urban Dictionary into MS Word and spice up run of the mill expressions with street lingo.

5. caveblogem - March 20, 2007

SG,

I think that the people who have gone before you have used every variation of the F-word already. So it’s unlikely to sho up. If you want, I’ll let you take a look at the cloud first and decide whether or not you want it posted.

I don’t know how to import files into word’s dictionary in bulk. And I’m not sure I’d want to at this point. Maybe later, but I like having more direct control over what goes in.

6. Stiletto Girl - March 20, 2007

Actually, no, go ahead with the public display! Oh, hopefully not too torturous on the eyes!

7. Language of the Blogosphere « KuiperCliff - March 21, 2007

[…] the vocabulary of the blogosphere. An unabashed non-programmer, he’s been doing things the long way round, with fascinating and entertaining results. In a twist on the now-ubiquitous tag cloud, he is […]

8. kuipercliff - March 21, 2007

Brilliant stuff. As you can see, I’ve written a wee blurb about this most excellent endeavour. If you’re still looking for volunteers, I’m totally up for it, and I promise to leave my favourite word at the door: sphenisciform.

9. caveblogem - March 21, 2007

kuipercliff,

You are in. Thanks for the link and the blurb. I’m thinking I could have yours done by Friday or Saturday.

Sphenisciform. Cool word. Does it have something to do with that weird-looking bone we have in back of our noses? Or are both similarly rooted in some other oddly shaped thing?

10. kuipercliff - March 21, 2007

I usually translate it as “of, or pertaining to, penguins”. I’d love to know what that nasal bone is btw. A recent straw poll of my archaeologist friends showed conclusively that the calcaneus (heel bone) was their favourite bone. They also decided 4-3 that archaeology was ‘boring’. Hence the survey.

Good news about the experiment. It’s about time my over-zealous use of inappropriate adverbs, ‘anyway’ and ‘although’, is outed once and for all. I look forward to the results.

11. Moon Topples - March 22, 2007

CaveyB: Oh, please consider adding my five words to the running total. Each blog you do means I’ll add less and less, but I’d love to be included. Maybe I’ll be your first subject to add no words at all.

12. Which words do you own?–A Mom, a Blog, and the Life In-Between « Pretty Good on Paper - June 15, 2007

[…] (to some) analysis of the most common words here.  And there is some discussion of method here and […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: