Retrieval
- Find the weights for each term(TF-IDF) in every document. Considering each term as a dimension in vector space, the weights form a vector for every document
- Then using the dot product compute ( vector (d.q) /|d||q|). We can save some division by not considering |q| as we are just going to rank the vector similarity.
- As the terms are sparse we can be query centric and form a inverted index of the words mapped with document ID and number of occurences of it in the document.
- We can further improve upon the Inverted Index by including the position of the word in that particular document (Occurence Index).
- This avoids looking for the document where in the weight is 0.
Aneeth