Tuesday, September 20, 2011

9/20/11

LSI can uncover hidden similarities that the normal cosine theta
similarity would miss. Given our familiar 10 document, 6 term d-t
matrix, suppose we have two documents D1 and D2, where D1 has the term
"regression" a huge amount of times, and none of the other terms, and
D2 has the term "likelihood" a huge amount of times, and none of the
other terms. Our intuitions say that we ought to expect similarity
(also validated by the cosine clusters in a previous lecture), but
when we actually compute the cosine similarity, their dot product is
0. LSI, on the other hand, will place the two documents in the
dimension with the "regression" and "likelihood" terms, which shows
that they are indeed similar.

Andree