Thursday, May 30, 2013

Performance Analysis of few Multi-dimensional Index Structures


Multi-dimensional Index Structure is an index structure that is built to work on data points in a multi-dimensional space. In Document Vector notation, each term of the Document Corpus is a dimension and every document is represented as a point in that multi-dimensional space. Retrieval of information from a multi-dimensional space requires specialized index structures to be built. The article published on my other blog dedicated to Information Retrieval and Machine Learning, tries to introduce broad categories of Multi-dimensional index structures, discuss few structures and finally analyze the performance of those index structures under consideration. 

Have a read and post your comments here or on the other blog of mine.

Saturday, March 9, 2013

Distribution of terms in Twitter Data


Term distribution is indeed a very interesting topic in Data Science, that once we start exploring it, we find many fascinating observations. I tried to plot the distribution of the terms on Twitter Data, taking a set of 10000 random tweets. I haven't removed any stop words, no spell checking was done. In fact, no technique that alters the data was implemented to just get a feel of the distribution of actual twitter data.
I have plotted the terms on X-axis in the decreasing order of their frequencies with the frequencies plotted on Y-axis. I got the below graph:

What this distribution means is that there are very few, in fact very very few words which are occuring most frequently. Without any doubt, these should be the articles, or conjunctions, or prepositions which are very much required for constructing english sentences.
Twitter allows users to post only 140 characters. As a Twitter user (@KausalMalladi), I find it difficult many times to fit my thoughts in that little space and I am sure it is same with everyone. On the flip side, because only 140 characters are allowed, we assume most of the words to be meaningful and relevant. But the distribution doesn't say that by providing a long tail. Why? I think it is because of “140 characters” and people tend to write short forms of the words and spelling mistakes are quite common in any social data.
The interesting part of the observation is that, although we are restricted to post only in 140 characters, in which case it is expected to make sense, it actually doesn't. May be we get a better term distribution with spelling corrections done on terms, I will try to do the same and post.

Same post is also published in my another blog.