Term distribution is indeed a very interesting topic in
Data Science, that once we start exploring it, we find many
fascinating observations. I tried to plot the distribution of the
terms on Twitter Data, taking a set of 10000 random tweets. I haven't
removed any stop words, no spell checking was done. In fact, no
technique that alters the data was implemented to just get a feel of
the distribution of actual twitter data.
I have plotted the terms on X-axis in the decreasing
order of their frequencies with the frequencies plotted on Y-axis. I
got the below graph:
What this distribution means is that there are very few,
in fact very very few words which are occuring most frequently.
Without any doubt, these should be the articles, or conjunctions, or
prepositions which are very much required for constructing english
sentences.
Twitter allows users to post only 140 characters. As a
Twitter user (@KausalMalladi),
I find it difficult many times to fit my thoughts in that little
space and I am sure it is same with everyone. On the flip side,
because only 140 characters are allowed, we assume most of the words
to be meaningful and relevant. But the distribution doesn't say that
by providing a long tail. Why? I think it is because of “140
characters” and people tend to write short forms of the words and
spelling mistakes are quite common in any social data.
The interesting part of the observation is that,
although we are restricted to post only in 140 characters, in which
case it is expected to make sense, it actually doesn't. May be we get
a better term distribution with spelling corrections done on terms, I
will try to do the same and post.