Tweet Notes (CSE 494/598 F11): September 2011

Friday, September 30, 2011

Re: 9/29/11

> Importance computation can be made very query specific, by first
> retrieving the top k results (e.g. using a vector space ranking
> method), and then retrieving links that the results forward points to
> (authorities) and also the links that are backwardly linking to it
> (hubs). In its entirety, this graph of links is called the "base
> set". Additional value can be extracted by also considering the other
> links (new authorities) that the hubs point to, and the other links
> (new hubs) that point to the authorities.
>

Andree

9/29/11

Importance computation can be made very query specific, by first
retrieving the top k results (e.g. using a vector space ranking
method), and then retrieving links that the results forward points to
(authorities) and also the links that are backwardly linking to it
(hubs). In its entirety, this graph of links is called the "base
set". Additional value can be extracted by also considering the other
links (new authorities) that the hubs point to, and the other links
(new hubs) that point to the authorities.

9/29/11

Authority and Hubs cannot directly be compared to Page rank. While they may result in similar rankings of pages, this is not usually the case.

-James Cotter

09/29/2011

There are 2 ways of introducing similarity to importance:

Global, where the importance is pre-computed before the query is submitted.

Query specific, which is slow since it's computed at query time.

--
Ivan Zhou

Graduate Student

Graduate Professional Student Association (GPSA) Assembly Member

School of Computing, Informatics and Decision Systems Engineering

Ira A. Fulton School of Engineering

Arizona State University

9/29/2011

Importance computation can be done at three levels: Global, Query specific and Topic level. At global level, each page has one rank for every query. At query specific level, each page has one rank specific to a query and at topic level, each page has one rank to specific topic.

--Dinu John

09/29/11

-->PageRank is query-independent and provides a global importance measure for every page on the web

-->Random Walk Model

-When there are no forward links web surfer jumps to any page on the Web at random with equal probability 1/N

-When there is a forward link web surfer will follow a link to another page with probability c, or follow a link chosen at random with probability (1 − c)

-Aamir

Thursday, September 29, 2011

09/29/2011

To incorporate importance as well as query similarity in a ranking a document, the best approach is to cluster documents in the corpus based on topic/class they belong to, find importance of the document wrt the cluster. This can be pre-computed and stored for later use.

When the query arrives, assign a topic to this query, find documents based on similarity wrt the query and calculate pageRank as:

PR=alpha*SimilarityVale + (1-alpha)* Importance.

Where alpha is any value between 0 & 1.

-Rashmi Dubey

09/29/2011

Stochastic Matrix M* need not be symmetric and hence is not guaranteed to have real eigen values, but its primary eigenvalue will always be 1.

Big Authorities have large PageRank and Pure Hubs have small PageRank.

-Rashmi Dubey

09/29/2011

Query specific importance computation might be more accurate compared to global importance computation
but its major drawback is that the importance is calculated after the calculation of similarity.
This drastically the query time and which is not acceptable.

-Anwasha Dhar.

Re: 9/29/2011

if similarity > 0

score = w * similarity + (1-w) * rank
else
0

-Arjun

9/29/2011

if similarity > 0

score = w * similarity + (1-w) * rank

else

9/29/2011

Importance calculation can be done Globally or Query-specific.
In global method, A/H or Page Rank is done once on the entire corpus.
Whereas in Query-specific method, Page Rank or A/H can be computed on the query results.

- Sandeep Gautham

Re: 9/29/2011

While both global and local specific approaches to link analysis have
their pros and cons, we can use a combination of the two, a topic
specific approach, to take the some of the query into account while
not sacrificing any performance. We do this by assigning the query to
topic(s) and using the weights in the calculation.

- Stephen Booher (Sorry for the repost; I forgot to sign my last "tweet.")

9/29/2011

Two times to combine similarity and importance is before and after a query. Doing before query is like combining apples and oranges, but doing after query one can compute similarity and then importance based on query. Tradeoffs correspond to speed and accuracy.

- Elias

9/29/2011

Pre-computing importance for each page is quick and simple to
implement, using HITS or Pagerank. Unfortunately, while the
president's homepage may be very important with the query "world
leaders", you wouldn't want him to show up when you search for
"computer science". While computing the importance based on the query
itself may be helpful in terms of accuracy, it will drastically slow
down the results.
-Kalin Jonas

9/27/2011

The HITS algorithm for ranking pages uses Hub and Authority scores to
rate each page. Hub scores are given to pages that point to other
pages with high authority scores. Authority scores are given to pages
that are pointed to by reliable hubs. Through iterations, these scores
increase for each page, then after several iterations, those pages
with high authority or high hub scores are presented. Presenting both
authorities and hubs increases the likelihood that one of the first
few results will be useful.
-Kalin Jonas

09/27/2011

Authorities are once where hubs are pointing to.

pages have both authority and hub scores.

Ranking should be given such that it should involve both authority results and hubs results.

-Rajshekar

09/27/2011

Authorities are once where hubs are pointing to.

pages have both authority and hub scores.

Ranking should be given such that it should involve both authority results and hubs results.

09/22/2011

Information Retrieval on the web. There are additional requirements
such as the need to crawl and maintain an index. And there are
additional structural advantages, such as links, tags, metadata, and
social structure. For example, certain HTML tags provide more emphasis
to their contents than do other. The use of hyperlinking adds a new
level of associations between documents. The words around anchor tags
can be used to describe the document that is linked, and the links on
a page can in turn describe itself. Unique phrases can be used to
create a strong relationship to a page more easily than a common
phrase. Hyperlinks can be used to establish a community of trust.
-Thomas Hayden

9/27/2011

Stochastic matrix is a square matrix each of whose columns consist of nonnegative real numbers, with each column summing to 1.

-Sandeep Gautham

09/27/2011

Principal eigen value for a stochastic matrix is 1

-Bharath

09/27/2011

For a small graph, we can check approach for calculating Hub and authorities and for web, we have to assume it is working.

Shubhendra

09/27/2011

Power Iteration is the best way to generate eigen vectors.

How fast do they converge?

This depends on the eigen gap between the first and the second vector. If the rigen gap is larger, then values converge faster and vice versa.

Suppose the primary eigen vector is orthogonal to the projection then performing power iteration to compute the eigen vector will not be of much use as the dot product will be zero.

Archana

09/27/2011

Making a markov chain safe

Link between all components (set weak links between disconnected components)
No sinks - set all 0 column in matrix to 1/n, n = number of nodes in the chain

--Shaunak

Wednesday, September 28, 2011

09/27/2011

Today we studied the concept of hubs and authorities, as well as the
concept of a steady state. Hubs and authorities can often be seen at
work in the real world. For example in social situations an authority
figure will often have many followers (hubs) which reinforces each
respective members authority and hub status. The general concept of
steady state is any essentially any state which tends to remain
consistent over a period of time. In web IR, this tends to be applied
to graph or matrix structures which can be used to calculate this
state. -Thomas Hayden

09/27/2011

Hubs point to Authority. Hubs value is calculated by the sum of all the Authorities points they're pointed at, and Authority's value is calculated by the sum of the hubs' points that are pointing at them. Since their values keep increasing in every iteration, calculate their unit vector.

--
Ivan Zhou

Graduate Professional Student Association (GPSA) Assembly Member

9/27/2011

I did asked this question in class:

Stable state (hub value zero) and probability of spending more time in particular web page will be the factors to estimate page rank. Though, link analysis can estimate which pages are expected to be most visited, it is not easy to know how much time a user spends at any particular page.

One simple method I can think of is by analyzing how often (considering time gap) a user selects alternative links proposed by search engines after selecting one.

--bhaskar

09/27/2011

Link Analysis is used to evaluate quality of a page (importance) before displaying the results to the user. There are two ways to do it:

1. Authorities and Hub - Uses L1- norm (Iterative Calculation of Authority and Hub Node Scores gives Primary Eigenvalue)

2. page Rank - using L1 norm. (Markov Chains Concept used; graph may be changed to satisfy Sufficient conditions for Markov Chain).

-Rashmi Dubey

09/27/2011

Power Iteration is the fastest way of calculating primary eigen vectors and the rate of convergence depends on the eigen gap.

-- Avinash

9/28/2011

The authority score is the sum of individual hub scores A(p) = ∑h(q).

For each iteration, obtain the unit vectors of authority and hub.

Relative values are important rather than the actual values.

Power iteration is a fast way to calculate the primary Eigen Vector.

Srividya

Tuesday, September 27, 2011

9/27/2011

Steady state vector = primary eigen vector.

Preethi

9/27/2011

Authorities are one which hubs are pointing to and Hubs are one which are pointing to important authority. A metaphor in Hollywood: Movie start is an authority and an agent is the Hub.

--Dinu John

9/27/2011

Sufficient conditions for making a graph "safe":
1)remove sinkholes by setting any column of all 0s to all 1/n, where n is the number of nodes
2)remove disconnected components by making weak links to all other nodes.

-James Cotter

9/27/2011

Probability Stochastic Matrix has columns in such a way that the sum of values of each column add to 1.

Anuj

9/27/2011

Power Iteration is the fastest way of calculating primary eigen vectors.

----Abilash

9/27/2011

In the Authority/Hubs ranking, we can reach a steady state by only
caring about the relative values of the authority and hub scores, and
we get that relative value by dividing by the L2 norms.

9/27/2011

The necessary conditions for existence of unique steady state distribution: Aperiodically (it is not a big cycle) and Irreducibility (each node can be reached from every other node with non-zero probability).

- Elias

9/27/2011

The importance of a page is the probability that a random web surfer will find themselves on the page.

- Elias

09/22/2011

Positive Symmetric

dd or tt should always have positive values and it cant be negative or complex as it is under the square root

-Vaijan

09/22/2011

About Eigen vectors

All Eigen vectors are orthogonal to each other, so any dot product value of eigen vectors is "0".

-Vaijan Hundlekar

What Latent means in Latent Semantic Index

Latent means hidden dimension(x-y axis in 2-d), which is different from the original dimension(x'-y' axis in 2-d).

-Vaijan

09/22/2011

Facts about LSI
1. It captures synonymy, poly-semi, correlations
2. It can't find correlation for words that are not present queries (" find all docs NOT containing term X")
3. It Captures linear correlations

--
Akshay Sumant

9/22/11

Food for thought: LSI does not capture any negative correlation.

09/27/20111

Latent Semantic Indexing does everything

- Noice Reduction

-Correlation Analysis

-Dimensionality Reduction

--
Nikhil Pratap
Graduate student
Department of Computer Science
(Arizona State University)

9/22/2011

Web pages are mostly HTML documents, allow the author of a web page to control the display of page contents on the Web, and express their emphases on different parts of the page. We can make use of the tag information to improve the effectiveness of a search engine.

If the anchor text is unique enough, then even a few pages linking with that keyword will make sure the page comes up high, while for more common-place keywords, you need a lot more links.

- Shu Wang

9/20/2011

Singular value decomposition (SVD) is a method for identifying and ordering the dimensions along which data points exhibit the most variation, and once we have identified where the most variation is, it's possible to find the best approximation of the original data points using fewer dimensions. Hence, SVD can be seen as a method for data reduction.

What makes SVD practical for NLP applications is that you can simply ignore variation below a particular threshhold to massively reduce your data but be assured that the main relationships of interest have been preserved.

when map each document and query vector into a lower dimensional space, we call it Latent semantic indexing(LSI).

- Shu Wang

Sunday, September 25, 2011

9/22/2011

Hubs and Authorities could be used to determine the importance of a webpage.

Howmany friends do you have? How many friends do they have?

If a lot of popular people are friends with you, you must be popular.

If a webpage has many inlinks from high authority pages, it is important.
--
Mohal

Saturday, September 24, 2011

09/22/2011

It is very difficult to fetch trustworthy data from the web merely by using correlation analysis. Link Analysis is the technique which helps in getting the trustworthy data wrt the query.

-Rashmi Dubey

09/22/2011

Latent Semantic Indexing is only for linear combinations.

Non-linear dimensionality reduction is far more time consuming that LSI.

-Rashmi Dubey

09/20/2011

Latent Semantic Indexing is so called because the new factor space is latent (unknown) until the data is analyzed.

-Rashmi Dubey

09/22/2011

Anchor text has higher importance than the plain text of the page it's pointing to. And if the anchor text is very common, then it requires a lot more work & links with that name so the page has a higher rank.

-Ivan Zhou

Graduate Student

Graduate Professional Student Association (GPSA) Assembly Member

School of Computing, Informatics and Decision Systems Engineering

Ira A. Fulton School of Engineering

Arizona State University

Friday, September 23, 2011

09/22/2011

An importance of a webpage depends on the webpages that point to it and what it points to in a recursive way.

-Sekhar

09/22/2011

Documents on the web are identified not only by the text they contain but also by

the anchor text in other documents that refer to them. This anchor text becomes a part of the content of the document.

Thus in effect the content of the document can change even if the document itself is not modified.

--Vaibhav

09/22/2011

Documents on the web are identified not only by the text they contain but also by

the anchor text in other documents that refer to them. This anchor text becomes a part of the content of the document.

Thus in effect the content of the document can change even if the document itself is not modified.

09/22/2011

Anchor tag play an important role in page ranking. The text which is anchored can be more relevant to the query than the actual content in the page it self.

--Avinash

9/22/2011

Weight of a term in general are pre-set or learned.

-----Abilash

9/23/2011

Interesting: Binging (googling in bing) the phrase “miserable failure” still shows George W. Bush White House web page as the second result.

-Jadiel

9/22/2011

Positive square root value of Eigen Values are called as Singular Value.

-Sandeep Gautham

Thursday, September 22, 2011

09/22/11

Web IR as opposed to Traditional IR

--> Pages on the web contain links to other pages

--> Possible to exploit the anchor text contained in links as an indication of the content of the web page being pointed to

--> Processing the collection involves gathering static pages and learning about dynamic pages.

-Aamir

09/22/11

Web IR as opposed to Traditional IR

--> Pages on the web contain links to other pages

--> Possible to exploit the anchor text contained in links as an indication of the content of the web page being pointed to

--> Processing the collection involves gathering static pages and learning about dynamic pages.

9/22/2011

The web consist of documents/pages which advertise itself falsely saying it contains the terms so that it can be retrieved. For example any one can create a lookalike home page of XYZ claiming to be the original webpage. So its the challenge for the retrieval engine now to not only get the relevant documents but also the trustworthy information.

-- Dinu John

9/22/2011

A result of link analysis is that now links to a page can be considered content of the page linked to. If you direct many links to a page with the same anchor text, search engines using link analysis may pick up on this and correlate the page with the anchor text.

-James Cotter

9/22/2011

The importance of term occurance within a web page is also determined by the font size, whether the term occurs in the header,footer or at the body of the page.

Page relevance is determined by

link analysis
the amount of accesses the page gets
the query
user
stability of the link strcuture
ease of dealing with the subvert

Preethi

9/22/2011

Link Analysis:

Link between pages to decide page importance.

links to pages are given higher weight than page content.

Contents of page can be changed by just keeping links to that page.

-Rajasekhar

9/22/2011

Page importance and page trust worthiness should be a part of IR ranking

Tag importance:

Importance is given on element surrounding the data in the tags.

eg: Big font is more important than small font

So term frequency varies for different tags.

-Rajasekhar

09/22/2011

Use of tag information:

1) Tag information can be used to rank results. Text appearing in <head> (for e.g.) might be considered more important than text appearing in body of footer.

2) Anchor text (text contained in HREF tag) plays an important role in the ranking of pages. (example used in class of using a very uncommon word as anchor text to point to a particular website).

--Shreejay

09/22/2011

A page is not ranked only based on its content but also based on which other pages (and how important they themselves are) reference it. Hence these two points are very important while ranking pages

1) Page importance should not change dramatically with respect to small changes in link structures.

2) Intentional changes to link structures should be taken care of.

--Shreejay

09/22/2011

Pages that are picked should not just be relevant but also trustworthy. Trustworthiness should be a part of page ranking. At the same time relevance of a page is different from importance of that page. Relevance and importance should be considered in deciding to show the resultant pages to the end user. The reason is users can decide the relevance of a page but the users would not know the correctness of that page. Also importance is understood to be a fixed point computation.

Archana

9/22/2011

The traditional IR cannot be applied directly to the web pages as we pages are

very voluminous
widely distributed
extremely dynamic
the data is more structured
link between pages needs to be considered

We need to establish importance and trust worthiness of pages which influences the ranking.

The tag information is used to determine the importance and the presence of links makes the content non local.

Srividya

9/22/2011

The content of a web page is non-local. It is shared by the links that point to the page.
Anchor text are given higher weight than the pages' contents.

Ramya

9/22/2011

There are at least two ways to utilize the structure on the web in
ranking pages for a query. One way is to use the semantics of the page
markup to create weighted vectors based the importance of the terms in
the page; another way is to use the anchor text of links pointing to
the page which adds more terms to the page.

Stephen Booher

9/22/2011

Anchor text can be considered more important than the actual contents
of the page pointed to. The benefit is that a user can find a page by
describing it, but doesn't have to know the actual contents of the
page. The downside is that defamatory links can make a page show up
where it may not deserve it.
Kalin Jonas

9/22/2011

LSI is for linear analysis to effectively reduce dimentionality, reduce noise, exploit redundant data, and find term correlations.

- Elias

9/20/2011

An analog to LSI is the fourier transform where the input signal is split up into the frequency domain allowing dimension reduction to filter out those unwanted frequencies. This is directly correlated with LSI where we can do dimensionality reduction that allows us to see the correlation between terms and documents.

-James Cotter

9/20/2011

SVD of a matrix dt is given as:
d-t = d-f * f-f * t-f^T
d-f and t-f^T are orthonormal matrices
f-f is a diagonal matrix. Here the diagonal elements are s₁, s₂ ...... s_k where si=(ith eigen value of the d-d or t-t matrix)^1/2

^Ramya

9/20/11

To reiterate, In implementing LSI, we are looking for axes that capture the

maximum variance. The primary eigen vector captures

the maximum variance representing the most important axes/dimension.

The eigen vector associated with the next largest eigen value captures

the next greatest variance and represents an orthogonal axis.

09/20/2011

Eigen values are guaranteed to be symmetric as d-d and t-t are positive symmetric matrices.

--Avinash

09/20/2011

For an orthogonal matrix, transpose of matrix is equal to inverse of matrix.

Shubhendra

09/20/2011

We need to assume that the given data is corrupted, and the data we get after dimensionality reduction is the real data.
The least important dimensions, i.e the dimensions in which variance is very less, are actually noise and hence can be discarded.

Kuhu

9/20/2011

positive symmetric matrices always have positive eigen values

-Arjun

Wednesday, September 21, 2011

9/20/2011

d-f * f-f gives coordinates of documents and terms in a factor space

Dimentionality reduction:

It is nothing but filtering the signals or data etc.

one of the main use of dimentionality reduction is to find clusters in the data.

-Rajasekhar

9/20/2011

-Eigen vectors are always orthogonal to each other.

document term matrix can be obtained by

d-t = d-f x f-f x (t-f)transpose

where d-f represents eigen vectors of dt*td

t-f is eigen vectors of t-t= transpose(d*t)*(d*t) = term-term co-relation matrix

where f-f conatins eigen values of either t-t or d-d

-Rajasekhar

9/20/2011

LSI has no notion of labels or clusters, it tries to preserve the uniqueness of each data point.

-Anwasha

9/20/2011

The documents and terms are vectors in space of factors.We use LSI to capture variance.

d*t = (d*f) * (f*f) * (f*t). The Rank of the matrix of (f*f) determines the dimensions.The dimensions can be further reduced which gives us a better insight of data variance and also understand the similarities . LSI captures the clusters over data.It can capture the intuitions which the cos(theta) similarity would never capture i.e..like synonymy and polysemy.

Srividya

09/21/2011

Latent Symantic Indexing allows us to capture correlations between
words that we previous did not know about. LTI is accomplished through
use of SVD. A number of smallest eigen values are chosen to be
removed, and then the decomposition is re-multiplied. This produces a
lower dimension approximation that captures the most variance possible
in those dimensions. This lower dimension approximation causes
correlated documents to become "clustered" together. A simple way of
finding query similarity in this space is to add the query as a sudo
document to the original document domain before decomposition, and
then using the distance to other real documents after reduction.
Alternatively you can transform the query into the LSI space using the
form DF*Q*FF-Q*TF. - Thomas Hayden - Thomas Hayden

9/15/2011

Eigen vector decomposition can be thought of as putting an ellipsoid on the data. Eigen vectors convey us about the new dimensions(independent orthonormal axes) and the eigen values convey the importance of these dimensions in accounting for the variance of the distribution.

Anuj

9/20/2011

LSI - dimensionality deduction:

We want to gather maximum characteristic of a document, including the ones that are not obvious in" vector in terms space" representation in a way that we can classify that document by looking at it from the fewest angles.

--
Mohal

09/20/2011

Eigenvectors of the correlation matrix are guaranteed to be always positive. And LSI captures similarity & correlation.

-Ivan Zhou

--
Ivan Zhou

Graduate Student

Graduate Professional Student Association (GPSA) Assembly Member

School of Computing, Informatics and Decision Systems Engineering

Ira A. Fulton School of Engineering

Arizona State University

Tuesday, September 20, 2011

9/20/2011

Dimensional reduction can be helpful for making sense of data with
many dimensions. Even when removing several dimensions, a large
portion of the original variance can be maintained. Some dimensions
must be retained for the cosine-theta distance measurement to be
useful.
Kalin Jonas

09/20/2011

Why don't we reduce the dimensionality of the data set to 1- dimension?

We use Cosine similarity for computing similarity value between data points. As the dot product of two orthogonal vectors is zero, hence we do not reduce the dimensionality to 1.

LSI is an indexing and retrieval method that uses SVD to identify patterns of terms in a collection of text. It overcomes synonomy and polysemy. It will also subsume correlation analysis.

Archana

9/20/2011

q*TF is used to convert query vector into LSI space, where TF matrix is a matrix which describes how x & y direction unit vectors get mapped to x' & y' space.

-Dinu John

9/20/11

LSI can uncover hidden similarities that the normal cosine theta
similarity would miss. Given our familiar 10 document, 6 term d-t
matrix, suppose we have two documents D1 and D2, where D1 has the term
"regression" a huge amount of times, and none of the other terms, and
D2 has the term "likelihood" a huge amount of times, and none of the
other terms. Our intuitions say that we ought to expect similarity
(also validated by the cosine clusters in a previous lecture), but
when we actually compute the cosine similarity, their dot product is
0. LSI, on the other hand, will place the two documents in the
dimension with the "regression" and "likelihood" terms, which shows
that they are indeed similar.

Andree

9/20/2011

The most important dimension is the one which captures maximum variance.

- Sandeep Gautham

09/20/2011

SVD computation is of complexity O(km^2n+k'n^3) for m*n matrix. This takes a lot of time and hence it is not used in practice.

-Bharath

09/20/2011

Doc-doc (dd) matrix and term-term (tt) matrix are both positive symmetric matrix whose eigenvalues will always be positive. This is why we do not see negative or imaginary eigenvalues in FF.

--Shreejay

9/20/2011

Lower rank matrices are probably better because it will help reduce
noise. Someone has proven that the resulting d-f x f-f matrix will
always be the "best" (as in the most similar to the original) matrix
for its given rank.

Stephen Booher

9/20/2011

SVD in terms of IR is d-t = d-f x f-f x t-f' where d-f is the eigenvectors of d-d, f-f is the eigenvalues of d-d or t-t, and t-f' is the eigenvectors of t-t.

- Elias

09/15/2011

A Term-Term Matrix generally should be a diagonal matrix if the terms are truly independent if that is not the case then non diagonal entries can give us some incite how closely Corelated is one term to another particular term.

----Abilash

09/15/2011

Dimension reduction is related to a concept called feature selection, where we are defining in terms of feature and determining which of the feature is important. Feature selection is easier than dimension reduction, because in dimension reduction we do not know which of terms are important since all the terms are derived terms.

-Bharath

9/15/2011

Document is a vector in the space of term, term is a vector in the space of document. If you consider term-term metric, it should be dialog metric, if in fact the term are turly independent.

Alternatively, we can also do it with query log,(people typing this word my also typing this word, maybe it is for you too)
Two terms are related if they have high occurance in the documents.

Email messages are bag of address, so we can compute address correlation.

Amazon users are bags of purchases, so they can do corelation between purchases. people often buy these things also buy those things, would you like to buy it as well?
The benefit of term-term correlation is if these terms are not independent, we can explore their correlation.

- Shu

Monday, September 19, 2011

9/15/2011

Dimensionality deduction is a kind of compression technique.

- Sandeep Gautham

09/15/2011

Co-relations are symmetric which is confirmed by ij^th term = ji^th term in the scalar cluster matrix.

--Shaunak

Sunday, September 18, 2011

09/15/2010

When we perform dimensionality reduction,
Variance in data = eigen value of the top k dimensions / Total eigen value of all dimension vectors
Fraction of variance lost = 1 - Variance in data (Calculated above)

Kuhu

09/15/2011

From the term-document matrix you compute the term-term/correlation matrix which you normalize to obtain a new matrix (association clusters) that tells you how much are terms correlated with 1.0 as maximum value. But from this normalized correlation matrix, we can compute one last matrix called scalar clusters that shows the transitive correlation. In class it was shown that database, SQL, and index terms correlation numbers increased with the scalar clusters. When Gmail recommends people to add to an email to send, it's using some sort of correlation algorithm.

--
Ivan Zhou

09/15/2011

Scalar Clusters are used to construct Thesaurus. There can be a Global Thesaurus (GT) or a Local Thesaurus (GT). GT is constructed using all terms in all documents in the corpus whereas LT construction is query specific and is constructed by only using terms in the query.

LT is better than GT in scenarios when a term has a significant meaning wrt the query. For e.g. if we are looking for operation wrt computer science documents and construct a GT, we might get many uses of term 'operating' which will dilute the significance of the term.

-Rashmi Dubey

09/15/2011

Association clustering (AC) help us in determining correlation between neighbouring terms. AC lacks distributive correlation property, i.e. if a term t1 is associated to term t2, and term t2 is correlated to term t3, then t1 must be related to t3.

To overcome the drawback of AS, Scalar clustering (SC) is used. Scalar cluster is obtained by taking dot product of 2 term vectors in Association Cluster. SC determines neighbourhood correlation.

-Rashmi Dubey

Friday, September 16, 2011

09/15/2011

Representing the document in new dimention s.t. it is function of the original dimentions is essentially looking at the document from a different angle so we can gather the document with less dimentions.
From the fish example we can see that width increases as length increases (or other way) and we see certain behaviour mapped. A new dimention called size was than created to capture the maximum variation.

-Mohal

--
Mohal Shukla

MCS Computer Science (Information Assurance)

School of Computing, Informatics and Decision Systems Engineering

Arizona State University

Tempe, AZ

09/15/2011

While considering an example of Fish length and width to discuss dimensionality reduction, we saw ellipsoid like variation of graph of size. If this ellipsoid becomes absolute circle then the euclidean gap will be zero, hence there is no way to identify primary attribute that has least percentage of variation loss.

--Bhaskar Rangaswamy

09/15/2011

Singular value decomposition takes a rectangular matrix of gene expression data (defined as A, where A is a n x p matrix) in which the n rows represents the genes, and the p columns represents the experimental conditions. The SVD theorem states:

A_nxp= U_nxn S_nxp V^T_pxp

Where

U^TU = I_nxn

V^TV = I_pxp(i.e. U and V are orthogonal)

Calculating the SVD consists of finding the eigenvalues and eigenvectors of AA^Tand A^TA.

Regards,

Rajshekar

09/15/2011

A_nxp= U_nxn S_nxp V^T_pxp

Where

U^TU = I_nxn

V^TV = I_pxp(i.e. U and V are orthogonal)

09/15/2011

If the features which we considered are really independent of each other,when we plot them we see them distributed all over the graph and don't really see them concentrated over a particular area.

--Avinash

9/15/2011

Exponential value of Eigen gap = |Eval1 - EVal2|, or e^|Eval1 - EVal2| determines how fast or slow it would take vectors to become parallel to the principal Eigen vector while doing matrix multiplication of matrix with itself (While already calculating matrix multiplication between matrix and random vector) . Exponential value being more it takes less matrix multiplication and if being small, it would take much longer but eventually it becomes parallel to the principal eigen vector.

Shubhendra

Thursday, September 15, 2011

9/13/2011

Auto correction/suggestion

Discovered the use of k-gram bag intersection with both the lexicon base and/or query log to either provide auto-correction or enhance user search patterns. One striking thing is that if the IDF of a word in lexicon base is less than a threshold, it could be a typo error and hence we should probably not consider when intersecting! Also an interesting aspect on the use of edit distances, where transposition and alignment could reduce distances considerably. And it was refreshing to look at Bayes rule from a different perspective in correcting query errors which are not syntactical.
--
Aneeth

9/15/2011

Adding 2 documents together to form a third document in the t-d matrix does not change the dimensionality of the matrix because the new document is just a linear combination of the other two.
-James Cotter

9/15/2011

Collaborative filtering is the process of filtering or extracting information or pattern from a dataset.

--Dinu John

9/15/11

In PCA (principle component analysis), we are trying to reduce as much
as possible the number of dimensions while still retaining as much
information as possible. In other words, we are finding the axes that
show as much spread of the data as possible. The details have not yet
been discussed, but it turns out that the eigenvectors determine the
dimensions, and the corresponding eigenvalues determine the importance
of the dimension.

Andree

9/15/11

When analyzing a Document Corpus for correlations, a

global thesaurus construction can lose or

obfuscate a data relationship that would otherwise

be found in a local query specific thesaurus.

eg. operating in a c.s. query would associate the

word system.

operating in a global correlation likely to produce room

9/15/11

Under Scalar clusters if term k1 is co-related to k2 and if k2 is co-related to k3 ,then k1 is co-related to k3. This is computed by considering terms as vectors.By taking the dot product of terms in the association cluster we get scalar cluster.

Preethi

09/15/2011

Why SVD?

We are assuming a new document corpus considering each term/keyword as independent. Our aim is to reduce the number of dimensions of such a corpus.

What SVD does?

Given a matrix M find a matrix M` which has less number of dimensions that M. (rank of M` < rank of M)

Rank of a matrix = Size of the largest sub-matrix which has a non-zero determinant. (Rank of M can be considered as "true" dimensionality)

--Shreejay

09/15/2011

There are 3 ways of doing correlation analysis :

1) Analysis of entire document corpus (use Doc-term martix) - create a global thesaurus

2) Analysis of top-k documents which are similar to the current query (vector similarity of query and doc is high) - create a local thesaurus

3) Analysis of Query logs - where instead of doc-term matrix, we will consider query-term matrix - this type of analysis runs in the problem of cold start!

--Shreejay

09/15/2011

Keywords/terms in the document can be considered independent (each keyword/term has a unique meaning, non-redundant language). To use it Dimensionality reduction techniques are required. PCA (Principal Components analysis) is a technique to do such dimensionality reduction. PCA applied to documents is called latent Semantic indexing.

--Shreejay

9/15/2011

Having an ellipsoid around the data spread ( i.e.. plotted ) is Eigen decomposition .

Eigen values give the importance of dimension and Eigen vectors give the true dimensions.If an axis is discarded we loose variance.

Eigen gap = |Eval1 - EVal2|

Srividya

09/15/11

Singular Value Decomposition: If we want a matrix of a particular rank, we use SVD. i.e we reduce the dimensionality by keeping the matrix as close as possible to the original.

Dimensionality Reduction:Transformation of data in high dimensional space to a space of fewer dimensions.

Archana

09/15/2011

Recall is monotonically non decreasing and precision is never constant.

-Sekhar

9/15/2011

Using the terms within documents to analyze them may cause problems
with synonymous words or situational words. Computing a Singular Value
Decomposition can describe documents in terms of brand new concepts
while capturing synonymous and polysemic information.
Kalin Jonas

9/15/2011

Scalar clusters are computed by taking the dot product of the association cluster vectors.

- Elias

08/13/2011

Autocorrection is implemented in some ways through k-grams which are
mapped to words to which the user most likely wants. And the other way
is based on the edit distance which is computed through dynamic
programming.

-Ivan Zhou

09/13/2011

Levenstein edit distance considers all errors are equally weighted.

Akshay Sumant

09/13/2011

We can multiply TD(Term Document) matrix with its transpose ie DT (Document Term) matrix and that would result in a square TT(Term Term) matrix.
If the resulting matrix is a diagonal matrix, we know that is because each term matches itself.
But if it is not a diagonal matrix then we can add the related terms to the query and enhance our search results.
We can further normalize the resultant square matrix to intuitively see the co-relations between the terms.

-Anwasha Dhar

09/13/2011

Levenstein distance doesnt allow transposition operation.

Damerau–Levenshtein distance allows transpostion operation.

(i.e. cat act transposing ca to ac)

- Vaijan

9/13/2011

Relevance feedback:
Given query q, vector d - average vector of the relevant documents, d' - average vector of the irrelevant documents
relevance feedback can be used to create a new query q' for better results.
q = q + d - d'

some variations which can be done
q' = d - d' ( ignore original query q)
q' = αq + (1-α)(d-d')

Kuhu

09/13/2011

Bayesian approach for spelling correction can also be used for Information Retrieval, where instead of spelling correcting words you word correct queries.

--Shaunak

Wednesday, September 14, 2011

8/13/11

A power law distribution when applied to IR can be used to show that
those without query logs can have a harder time gathering them, due to
the inferior ability of a search engine without previous logs to
improve searching. Edit Distance - The distance between two words
determined by the number of insertions, deletions, replacements, or
transposition of letters necessary so that the two words are the same.
This process like me other discussed in class thus far, can be
weighted to give differing values to each of the type of changes, and
even different weights to different variations of each type of change
(such as some letter replacements being more or less heavily weighted
than some others). When using this process, proper alignment of
characters is necessary to prevent the edit distance from becoming a
high complexity problem. Correlation and Co-occurrence Analysis -
Terms that are related may be added to the results using a thesaurus
like method.
-Thomas Hayden

09/13/2011

It is better to use query logs rather than document corpus to get correction.
Two terms are related if they have high occurance in the documents.

-Shu

09/13/12

Edit distance is the measurement of how close a possibly misspelt word is to a correct spelt word.

Levenshtein edit distance is one type of edit distance where it counts how many inserts, deletions, and replaced characters one word is from another.

Damerau-Levenshtein edit distance is similar the the above but considers characters might have been typed out of order.

Weighted edit distance is like the above except that possible errors have weights where one spelling correction would be more likely than another.

Jesse Michael

9/14/2011

Answer to question 1: They are linearly independent, because there is no way that I can multiply one of them by a scalar to obtain the other. They are not a basis, because no linear combination of those two vectors can produce the vector <1, 1, 1>, which is in the three dimensional Euclidean space.

-Jadiel

9/13/.2011

P(k1 k2)=p(k1)*p(k2/k1) if k1 , k2 are dependent.

if k1 , k2 are independent p(k2/k1)=p(k2) i.e k2 doesnot depend on the occurrence of k1

hence we have

P(k1 k2)=p(k1)*p(k2)

Anuj

9/13/2011

Weighted edit distance depends on the characters involved i.e. for example we are more likely to mistype 'm 'as 'n' rather 'm' as 'a'

Anuj

9/13/2011

Given a word w1; using k-gram distance it is difficult to find the set of words which are 1-gram distance away from w1 and hence we edit distance metric-Levenshtein distance.

Anuj.

9/13/2011

Power law is applied when we want to analyse document corpus vs query logs. The few fav. search engine have ability to further improve by analyzing query logs (which is not readily available to others).

-Shubhendra

9/13/2011

Levenshtein distance is the minimum number of basic operations to convert a string S1 to S2.

The operations allowed are insert, delete and replace.

-Arjun

9/14/2011

Some of the issues with ir that make a perfect system difficult:

Representing the document(via bags, inverted index etc) loses some information
Words may have different meanings in different contexts
Difficulty in expressing queries that match the documents we really want.

Several techniques can be used to help resolve some of these:

Relevance feedback
Correlation-query elaboration via thesaurus etc..
Principal components analysis

-James Cotter

9/13/2011 class tweet

Edit distance between any two strings m and n is the minimum number of basic operation to convert m to n.

operations will be insert, delete,replace and transpose.

Levenshtein distance: Distance between any two strings is the minimum number of basic operations. and the basic operations are insert, delete and replace.

Levenshtein distance between parrot and parrat is 1

Regards,

Rajasekhar

Tuesday, September 13, 2011

09/13/2011

When we find edit /weighted edit/ Levenshtein distance between sequences, alignment of the two sequences with respect to each other plays an important part. Dynamic programming is used to find the optimal alignment. Sequence alignment problem is also seen in Gene sequence encoding and speech recognition.

-Rashmi Dubey

9/13/2011

Matrix multiplication can be understood as vector dot product i.e, between row and column vector.

-- Dinu John

9/13/2011

A similarity metric has three defining properties:

It's value are non-negative.
It's symmetric.
It satisfies the triangle inequality: |AC|< |AB|+|BC|

Most of the metric don't satisfy triangle law of inequality.

-- Dinu John

09/13/2011

Cold Start Problem: There needs to be enough users already in the system to find a match.

--
Nikhil

09/13/11

--> LSI(Latent Semantic Indexing) co-occurrence analysis is helpful in avoiding the problems of synonymy and polysemy.
--> We use k -gram index to retrieve vocabulary terms that have many k-grams in common with the query.

--> Levenshtein Distance:Minimum number of basic operations to convert S1 to S2.

Aamir

09/13/11

--> Levenshtein Distance:Minimum number of basic operations to convert S1 to S2.

9/13/2011

Correlation: Ability to predict the other word in the presence of one word.

Archana

9/13/2011

The Damerau-Levenshtein distance between two queries is the minimum
number of basic insert, delete, replace, and swap operations to
convert one query to the other.
Kalin Jonas

09/13/2011

Vector Multiplication is a series of dot products.

-Abhishek

09/13/2011

Query Logs are more indicative of what people are looking for and hence its analysis can throw up helpful results.

In case you want to create your own search engine, you won't have any query logs in the beginning.

This problem is called "Cold Start" problem.

- Abhishek

09/13/2011

Query Logs are more indicative of what people are looking for and hence its analysis can through up helpful results.

In case you want to create your own search engine, you won't have any query logs in the beginning.

This problem is called "Cold Start" problem.

- Abhishek

9/13/11

Fallacy of priors- assume all words are equal!

This is why we have to have a better models that

incorporate Relevance feedback or correlation

analysis.

Think about it

09/13/2011

If all the terms are independent then T-T similarity matrix should be diagonal.

--Avinash

9/13/2011

K gram distance is not generative.

Edit distance between strings S1 and S2, is the minimum number of basic operations to convert S1 to S2.

Levenshtein distance - includes insert, delete and replace

Damerau-Levenshtein - includes transposition also.

Ramya

09/13/2011

Documents Vs Query Logs

Documents generally are the things available out there where as query logs are actually the things which people are generally looking for.

Abilash

09/13/2011

Based on the % of documents in which two words k1 and k2 occur, they can be

Positively Correlated

Negatively Correlated or

Not Correlated.

--Shreejay

09/13/2011

Edit distance is the minimum number of basic operations(insert, remove, delete) need to be done to convert string s1 to string s2

-Bharath

09/13/2011

Levenshtein distance - edit distance between two words using the operations insert / delete / replace . e.g. à dog – do 1 (delete operation)

Damerau-Levenshtein – Same as above but this includes transpose also. Hence cat à act has 2 Levenshtein distance, but only 1 Damerau-Levenshtein distance.

Best Regards,
Shreejay Nair

09/13/2011

Relevance feedback is a mechanism for a user to interact with a text retrieval system to modify the user's original query to a new and better query.

where q:initial query
q':new modified query
C1:1/N1 where N1 is the cardinality of Retrieved Relevant
C2:1/N2 where N2 is the cardinality of Retrieved Irrelevant

-Sekhar

9/13/2011

Vector space ranking can be improved by 3 techniques:

1.Correlation Analysis

2.Relevance feedback

3.Principal components analysis.

- Sandeep Gautham

9/13/2011

Computing Jackard similarity: For the intersection operation, for each different word, take the minimum number of times it appears in any of the documents. For the union operation, take the maximum number of times that the word appears in any of the documents.

-Jadiel

9/8/2011

The forloop presented in the pseudo code used to compute the cosine similarity between two documents (using tf-idf) is over words, which is the faster way. If the outer level of the forloop were over documents, it would be slower.

-Jadiel

09/08/2011

Most Critical Time of a search engine is the time taken " when the user hits enter with their query and before the page gets shown". This is what makes inverted indexing important because we might spend lot of time creating it but eventually it cuts down the time of retrieval of a page when the user hits the query.

Abilash

09/08/2011

As we know IDF is a Global feature, it can be incorporated along with the inverted index which would save computation time during search.

All words are not equally important, ex) The word 'Computer' in the repository of computer documents has no role to play.

Therefore its useless to index words like 'The' which has an IDF zero.

One of the point to ponder is that there is a difference between words with IDF 0 as opposed 0.0000001. This changes the ranking of the pages after the initial 3000 search results.

Aneeth

09/08/2011

Retrieval

Find the weights for each term(TF-IDF) in every document. Considering each term as a dimension in vector space, the weights form a vector for every document
Then using the dot product compute ( vector (d.q) /|d||q|). We can save some division by not considering |q| as we are just going to rank the vector similarity.
As the terms are sparse we can be query centric and form a inverted index of the words mapped with document ID and number of occurences of it in the document.
We can further improve upon the Inverted Index by including the position of the word in that particular document (Occurence Index).
This avoids looking for the document where in the weight is 0.

Aneeth

09/08/2011

Because of the sparse of vector, inverted indexing become more useful.
Naive retrieval will touch every documents, so it is inefficient. Documents whose similarity would be zero would not be touched any more. That is inverted index.The importance is to use inverted index to compute similarity metrix.
In traditional IR, index is made of key word, and lexicon is made of key word. Modern search engines index the full text.
If you construct index your self, you should pay attention to stemming and stop words, which can not only reduce the index size, but also improve answer relevance.

-- Shu Wang

9/8/2011

If the corpus of the documents is constant we can pre-compute the idf for each term and store it in the vector space similarity matrix.

Anuj

9/8/2011

If we midway stop the calculation of the similarity matrix:

1)For Inverted Index: It reflects something about all the documents.

2)For Naive Retrieval: It reflects everything about some documents.

Anuj

Sunday, September 11, 2011

09/08/2011

1. Stemming and stop words elimination is done in traditional IR systems, and is not used now.

2. Inverted index is a data structure which maps the documents to terms. i.e it says which document conatain which term

3. All the terms in the index is reffered to as Lexicon or Vocabulary

4. Position information of a term in a document is useful for Proximity queries and also while displaying the snippet

-Bharath

9/6/2011

- There is an assumption that the words that are being considered as different dimensions as vectors are independent of each other. However there are techniques like Latent Semantic Indexing that can be used to see the correlation between terms.

- Since all the weights in the vectors are either positive or zero the we will not have more than 90 degrees between vectors.

Ganesh

9/8/2011

Jaccard similarity is used to find words in the lexicon that most resemble the query words whereas vector similarity is used to find the documents that should appear in the search results by comparing the document vectors with the query vector.

Ganesh

Saturday, September 10, 2011

09/08/2011

As huge vectors are sparse in nature we explore this sparsity by using Inverted Index.Any document whose similarity is equal to zero is not touched by Inverted Index. Inverted index does not change the ranking of the pages. Rank the documents in terms of reducing vector similarity. So using Inverted Index has an advantage that words of the query fall in a very small fraction of the set of documents.

Archana

Friday, September 9, 2011

09/08/11

--> Putting a threshold on the IDF value makes indexing simple.

--> Lexicon is the entire corpus of terms.

-->Forward Indexing is not efficient as every document needs t be parsed

-Aamir

Re: 9/8/2011

Traditional IR indexed only keywords, where as modern one's index full text, even stop words!

Shubhendra

9/8/2011

Google also indexes stop words and even displays result for "the" (http://www.google.com/search?q=the).

09/08/2011

Documents are divided into barrels(in order of first having low percentage of documents and last having higher) and results are shown to the user from the first initially and if they still stick then later ones are shown.

--Avinash

09/08/2011

Inverted Index: An index into a set of texts of the words in the texts. The index is accessed by some search method. Each index entry gives the word and a list of texts, possibly with locations within the text, where the word occurs.

-Regards

Rajasekhar

09/08/2011

Thursday, September 8, 2011

09/08/2011

In Inverted indexing technique, we don't deal with dictionaries but the Lexicon generated for the words in document corpus. Having such a Lexicon saves time during query processing. If document corpus is not changing, previously created lexicon can be used.

-Rashmi Dubey

09/08/2011

Document term Matrix is a d X t matrix where:

d - documents represented as row vector.

t - terms or words represented as column vector.

Since all words are not present in every document, this matrix is mostly sparse.

The cosine similarity of a document and the given query is calculated. Disadvantage of this approach lies in the fact that many documents are not at all similar to the query, that is they don't contain the relevant term. In such cases cos theta value is zero for most documents, but still it is calculated.

This issue can be resolved by using Inverted Indexing which selects documents based on query.

-Rashmi.

9/8/2011

Stemming is the process of reducing the inflected words into base or root form. The porter stemming algorithm developed by Dr. Martin F. Porter is widely used and is a de-facto standard algorithm for English stemming.

-- Dinu

8/9/2011

In bioinformatics, inverted indexes are very important in the sequence assembly of short fragments of sequenced DNA. One way to find out where fragments came from is to search for it against a reference DNA sequence.

Venkat

9/8/2011

I was randomly browsing about ranking algorithms and came across this page which explains the difference between popularity and the trustworthiness of a webpage. Google or Bing may follow different page ranking algorithm to rank the web pages. Personalization would also play an important role in determining the ranking of the webpage.

The popularity of the web page is calculated in terms of mozRank

MozRank is calculated at a scale of between 1 and 10 with respect to the popularity of the website. The underlying principle is very easy to understand. The more links a page receives, the better it will be popular and obviously it will increase the ranking of the site.

In the similar line, trustworthiness of a webpage is calculated in terms of mozTrust

mozTrust is another important factor which would influence the ranking of a page in the search engine results. This feature enumerates the confidence of a page relative to the other page found on the web. Again, the logic is quite similar. If a page gets a link from another page which was considered "trusted" by the engine, it is rated on a higher scale. The trusted sites can be any of the educational institutions like ASU, government websites, company sites etc.

For getting more information and technical aspect of these features, please refer this site.

http://www.seomoz.org/learn-seo/mozrank

http://www.seomoz.org/learn-seo/moztrust

--
Sathish

9/8/11

Modern Search engines index using not only words but

phrases like "computer science".

In noun phrase dection, "computer science" is indexed as

only one word.

9/8/2011

Inverted indexes are usually stored in the form of a hash table or a binary tree.

9/8/2011

For each term , we maintain the number of documents that have that word and a pointer to location which stores the Document IDs and positions of the word in the document.

Srividya

9/8/11

Capturing position of a word in a document is important for performing proximity searches.

-Arjun

9/8/2011

Approximate ranking works (reasonably well) in most cases because most users want higher precision at lower recalls.

--Shaunak

9/8/2011

When computing the similarity between a query and a document one should be document centric instead of query centric in order to take advantage of the sparsity (0 weight) of terms in a document.

Analogy: Index at the end of a textbook.
Solution: Inverted Index

--Shaunak

9/8/2011

Three techniques for generating keywords:
1. Stop Word elemination- Eliminate common words in the lexicon(eg do not index them). In English, some examples would be "the", "an"...
2. Noun phrase detection- combine multiple words that occur together eg."data structure"
3. Stemming- remove endings of words so that the query can be matched more easily to those words indexed from documents. eg. "walked"->"walk"

-James Cotter

9/8/2011

The dimension of Doc-Term Matrix is very big but they are sparse.
(where weight = 0 for many terms)

-Sandeep Gautham

9/8/2011

Traditional IR involves 'Keyword' indexing. Lexicon is made of 'keywords' only.

Manual Keyword generation includes:
1. Stemming
2. Stop word elimination

Advantages of Stop word elimination:
- reduces the index size
- improves the term relevance

Ramya

09/08/2011

Inverted Index is query centric and the similarities of all documents are computed simultaneously.

-Sekhar

9/8/11

Lexicon is a container for words. Inverted index is a data structure storing a mapping from content to document or set of documents.

- Elias