Thursday, September 8, 2011

09/08/2011

Document term Matrix is a d X t matrix where:
d - documents represented as row vector.
t -  terms or words represented as column vector.

Since all words are not present in every document, this matrix is mostly sparse. 
The cosine similarity of a document and the given query is calculated. Disadvantage of this approach lies in the fact that many documents are not at all similar to the query, that is they don't contain the relevant term. In such cases cos theta value is zero for most documents, but still it is calculated.

This issue can be resolved by using Inverted Indexing which selects documents based on query.

-Rashmi.