Tweet Notes (CSE 494/598 F11): 09/08/2011

Tuesday, September 13, 2011

Retrieval

Find the weights for each term(TF-IDF) in every document. Considering each term as a dimension in vector space, the weights form a vector for every document
Then using the dot product compute ( vector (d.q) /|d||q|). We can save some division by not considering |q| as we are just going to rank the vector similarity.
As the terms are sparse we can be query centric and form a inverted index of the words mapped with document ID and number of occurences of it in the document.
We can further improve upon the Inverted Index by including the position of the word in that particular document (Occurence Index).
This avoids looking for the document where in the weight is 0.

Aneeth