Sunday, October 30, 2011


Interesting point noticed while analyzing project part 2 results.

It is expected to rank page based on following equation: W*Pagerank + (1-W)*TF/IDFSimilarity, where Pagerank is a probability of importance of a page against entire document corpus (including irrelevant documents), and TF/IDF Similarity is the probability of similarity measure of a page in the cluster formed by relevant documents. If W is not carefully defined then this probability consideration would result in some irrelevant pages getting ranked high over relevant page.