John Wonderlich wrote:
After defining (and normalizing) the likelihood that words appear in text, you could start making comparisons between bodies of work, and creating interesting tag-cloudish visualizations of what distinguishes some text you’d like to analyze. You could build a widget for your blog that says “the following are the words that are more than 25% more likely to be used on this blog than they are to be used in New York Times cover stories”, or, “here are recent news stories that also have similarly unlikely words used.”
I don’t know how people usually do cloud visualizations, but if I were
making a word cloud, that’s *precisely* what I would do — i.e. this is
probably how people do it.
Now, the thing is that word counts actually don’t get you very much information. Remember back to the days before Google- search engines gave you back documents by matching words and returning documents where you search terms appeared most frequently. Then Google came along and ranked documents differently and we all saw how *awful* word frequency was for determining relevance to a query.
So the question is what you would use word counts *for*. Clouds are nice, but look for cases where words aren’t exactly the appropriate level of chunking to identify relevance. (And, you will see this in most word clouds.) Articles back in 2004 about the Democratic ticket might have used the word “John” an exceptional amount owing to the dynamic duo’s shared first name, but “John” in a word cloud isn’t very informative. You’d want to chunk whole names together, but that’s a difficult problem in itself.
Note also for comparing documents that the frequency of a word isn’t very indicative of a word’s prominence in a text, and if you have a profile (i.e. vector) of word frequencies for two documents, it’s not immediately obvious how you would compare profiles to arrive at whatever result you want. (Not to say there aren’t ways to do it, but that there are many ways to do it.)