Saturday, March 9, 2013

Distribution of terms in Twitter Data


Term distribution is indeed a very interesting topic in Data Science, that once we start exploring it, we find many fascinating observations. I tried to plot the distribution of the terms on Twitter Data, taking a set of 10000 random tweets. I haven't removed any stop words, no spell checking was done. In fact, no technique that alters the data was implemented to just get a feel of the distribution of actual twitter data.
I have plotted the terms on X-axis in the decreasing order of their frequencies with the frequencies plotted on Y-axis. I got the below graph:

What this distribution means is that there are very few, in fact very very few words which are occuring most frequently. Without any doubt, these should be the articles, or conjunctions, or prepositions which are very much required for constructing english sentences.
Twitter allows users to post only 140 characters. As a Twitter user (@KausalMalladi), I find it difficult many times to fit my thoughts in that little space and I am sure it is same with everyone. On the flip side, because only 140 characters are allowed, we assume most of the words to be meaningful and relevant. But the distribution doesn't say that by providing a long tail. Why? I think it is because of “140 characters” and people tend to write short forms of the words and spelling mistakes are quite common in any social data.
The interesting part of the observation is that, although we are restricted to post only in 140 characters, in which case it is expected to make sense, it actually doesn't. May be we get a better term distribution with spelling corrections done on terms, I will try to do the same and post.

Same post is also published in my another blog.