What do you want to do with word frequencies?

John Wonderlich wrote:

After defining (and normalizing) the likelihood that words appear in text, you could start making comparisons between bodies of work, and creating interesting tag-cloudish visualizations of what distinguishes some text you’d like to analyze. You could build a widget for your blog that says “the following are the words that are more than 25% more likely to be used on this blog than they are to be used in New York Times cover stories”, or, “here are recent news stories that also have similarly unlikely words used.”

I don’t know how people usually do cloud visualizations, but if I were
making a word cloud, that’s *precisely* what I would do — i.e. this is
probably how people do it.

See:
http://en.wikipedia.org/wiki/TFIDF
http://en.wikipedia.org/wiki/Latent_Semantic_Indexing

Now, the thing is that word counts actually don’t get you very much information. Remember back to the days before Google- search engines gave you back documents by matching words and returning documents where you search terms appeared most frequently. Then Google came along and ranked documents differently and we all saw how *awful* word frequency was for determining relevance to a query.

So the question is what you would use word counts *for*. Clouds are nice, but look for cases where words aren’t exactly the appropriate level of chunking to identify relevance. (And, you will see this in most word clouds.) Articles back in 2004 about the Democratic ticket might have used the word “John” an exceptional amount owing to the dynamic duo’s shared first name, but “John” in a word cloud isn’t very informative. You’d want to chunk whole names together, but that’s a difficult problem in itself.

Note also for comparing documents that the frequency of a word isn’t very indicative of a word’s prominence in a text, and if you have a profile (i.e. vector) of word frequencies for two documents, it’s not immediately obvious how you would compare profiles to arrive at whatever result you want. (Not to say there aren’t ways to do it, but that there are many ways to do it.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s