Wednesday, September 7, 2011


The reason to use vecoter similarity is that linear algebra can help us find and build the independent relationship between documents or queries.

We can transfer the bag with term to vector with dimention without changing, it is almost the same. We can use  Eucledian distance to present the similarity. The more the distance, the lower the similarity.

All the key words are independent of each other in documents.

If you don't normalize the dot product, the eucledian distance is not comparable.

If the word appear in every document, it is useless, then try to figure out what feature can really represent a document

-Shu Wang