Tuesday, September 6, 2011

9/6/2011

I found the difference between Euclidean and cosine similarities to be an interesting reminder of the artificiality of representing a document as a vector of word frequencies.  Mathematically, I wouldn't see any difference to use one over the other here, but because of the properties of the documents being represented cosine somehow happens to give a more intuitive result.  I'm curious whether there could exist some other method for computing distance that is tailored to how humans perceive similarity in these cases, and would thus give a better result than cosine.

Joseph Junker