Monday, September 5, 2011

09/01/2011

Desiderata for Similarity Metrics

- Partial matches should be allowed - Sometimes the query might be over stated, that does not mean results should be returnted only if all the word matches. System should be able to match with important words in the query
- Weighted matches should be allowed - Common words in the query should be given less precedence over the rare words. This can be found out by computing document frequency for each term in the query

- Relevance should not depend on the size - Two documents may match the words in the query. One might have term frequency more than the other. But that does not mean the document with more term frequency is more relevant than the other. The document might have doubled its size just by duplicating the content.

-Bharath