Wednesday, September 21, 2011

09/21/2011

Latent Symantic Indexing allows us to capture correlations between
words that we previous did not know about. LTI is accomplished through
use of SVD. A number of smallest eigen values are chosen to be
removed, and then the decomposition is re-multiplied. This produces a
lower dimension approximation that captures the most variance possible
in those dimensions. This lower dimension approximation causes
correlated documents to become "clustered" together. A simple way of
finding query similarity in this space is to add the query as a sudo
document to the original document domain before decomposition, and
then using the distance to other real documents after reduction.
Alternatively you can transform the query into the LSI space using the
form DF*Q*FF-Q*TF. - Thomas Hayden - Thomas Hayden