Tuesday, September 13, 2011

09/08/2011

Retrieval
  1. Find the weights for each term(TF-IDF) in every document. Considering each term as a dimension in vector space, the weights form a vector for every document
  2. Then using the dot product compute ( vector (d.q) /|d||q|). We can save some division by not considering |q| as we are just going to rank the vector similarity.
  3. As the terms are sparse we can be query centric and form a inverted index of the words mapped with document ID and number of occurences of it in the document.
  4. We can further improve upon the Inverted Index by including the position of the word in that particular document (Occurence Index).
  5. This avoids looking for the document where in the weight is 0.
 
Aneeth