Tweet Notes (CSE 494/598 F11)

8/30/2011

DB VS IR

We use a complete query to search the database and the result obtained is given to be precise, however in IR we use an incomplete query (keywords) and the user has to determine the relevance of the results.

The reasons being the IR DB is not structured and hence must have a complete understanding of NLP to retrieve a precise result. Since this is not acheived, the queries are illposed by the user.

In traditional DB each record in the db has the probability of either 0 or 1(Valid/Invalid) for the query posed. Whereas the results of IR are ranked based on the probability of relevance which can vary betweeen 0 and 1.

This raises a question of how the Relevance calculated. The Relavance of the document not just decided by the IR system, it learns through the user inputs to improve the probability of relevance.

Aneeth Anand

08/30/2011

The "relevance" of IR system mainly depends on the following factors:

1.the number of documents needed and matched.
2.the query representation and proximity.
3.the feedback between users and system.
4.the combination of other documents already shown.

Shu Wang

Best wishes,

Shu Wang

Computer Science Department
Arizona State University

08/30/2011

There is one way to evaluate IR algorithm, it is "TREC"(Text Retrieval Competion), which uses relevant documents, queries, etc, to judge whether the system ranks high based on the precision recall on the corpus of queries.

Shu Wang

Best wishes,

Shu Wang

Computer Science Department
Arizona State University

8/30/2011

Generally, the performance of IR system can be determined by 2 factors, precision and recall, which may play different roles in different cases. Most of time, when the user hope to get much more relevent information ranked high in a quite short time, then he/she hope IR system to be more precise. while sometimes, when users wish to get more relevant information even if it has a little bit relationship with the key words, in this situation, he/she needs "recall".

But as to the design of IR system, we'd better make the trade off between them.

Shu Wang

--

Best wishes,

Shu Wang

Computer Science Department
Arizona State University

8/30/2011

1. Since, relevance is an ill-specified problem, we try to specify a relevance number R(.) to each document. The relevance number R(.) is given by:

R(d|Q,U,{d1...dk})

d- the document for which we are trying to find the relevance number.

Q- the query specified by the user

U- The user himself, as we have already established the fact that in IR systems, its the user who assigns the relevance to the document (or determine which document is more relevant to him).

{d1...dk}- already shown documents.

-Rashmi Dubey

8/30/2011

1. Precision and Recall are good measures to evaluate IR engine. In terms of set theory, this is expressed as:

U (universe)= tn + fp + tp + fn

Relevant docs user is looking for= tp+fn;

System returns the following=tp+fp;

where,

fp-false positive

tp-true positive

fn-false negative

tn-true negative

For a good IR system, fn=fp=0, such that: (Relevant docs user is looking for)=(System returns the following);

2. Missing precision can be easily caught by user.

3. In precision-recall curve, precision starts at 1.0 and usually reduces as recall increases.

4. F-measure or F-score is the harmonic mean of Precision and Recall. Harmonic mean (HM) is used as HM is closer to the minimum of the terms in the mean.

5. When precision(P) and recall(R) are weighed equally, then F-measure is called F1-Measure and is given the the formula:

F1=2(P)(R)/(P+R)

Weight can be assigned to Recall wrt Precision, say B (read Beta) which should be a positive real number. Hence Recall weighs B times more than Precision. F-measure in such cases is given by:

FB=(B*B +1)(P)(R)/(B*B*P+R)

-Rashmi Dubey

8/30/2011

1. Users play no role in traditional database queries. The results returned by DB are 100% correct wrt the query.

2. IR query Challenges:

(a) IR doesn't understand NLP,

(b) user is aware of this fact.

3.Due to above IR queries are usually imprecise, under-constrained & imposed. True relevance to the IR queries is imposed by User.

-Rashmi Dubey

Thursday, September 1, 2011