Monday, December 5, 2011

09/27/2011

Hubs point to Authority. Hubs value is calculated by the sum of all the Authorities points they're pointed at, 
and Authority's value is calculated by the sum of the hubs' points that are pointing at them.

In page rank / auth and hub power Iteration is the fastest way of calculating primary eigen vectors
 and the rate of convergence depends on the eigen gap.

--
Nikhil Pratap

12/1/2011

The challenge with data integration is that, the data is spread over different sources with different schema and lot of time is consumed cleaning, migrating these data and integrating into one unit.

-- Dinu John


12/1/2011

Information integration is the process of combining data from difference source and giving one view of the entire data in a federated manner.

-- Dinu John

11/29/2011

Extraction pattern can be specified as path from the root of the DOM tree using Xquery or we can write regular expression to extract particular portion of the web page.

-- Dinu John


11/29/2011

The process of extracting data from webpages is referred as screen scraping.

-- Dinu John

11/29/2011

Extractors for a semi-structure web site are referred as wrappers. These wrappers are custom wrappers and have to be modified if the structure of web page changes.

-- Dinu John


11/29/2011

HTML structure of the web pages belonging to a web site is specific and regular. They all follow a specific pattern and have a template.

-- Dinu John


10/27/2011

In Parametric learning the size of the learn representation is separately described and is not promotional to the training data. 
E.x: Linear Classifiers.

Non parametric learners keep entire training data around.
E.x: K-NN

Problem with K-NN is that it doesn't do well in high-dimensions.


--
Nikhil Pratap

10/25/2011

Relevant Feedback always rewrites the query such that the rewritten query
will always have high precision and recall.

--
Nikhil Pratap
Graduate student
Department of Computer Science
(Arizona State University)

Sunday, December 4, 2011

11/03/2011

In the Content-Based Recommending, recommendations are based on information

on the content of items rather than on other user's opinions.


Adv:

We can recommend to users with unique tastes.

We will be able to recommend new and unpopular items.


--
Nikhil Pratap

11/1/2011


The Collaborative filtering follow the following steps:


Weight all users with respect to similarity with the active user.

Select a subset of the users (neighbors) to use as predictors.

Normalize the ratings and compute a prediction from a weighted combination of the selected neighbors ratings.

Present items with highest predicted ratings as recommendations.


--
Nikhil Pratap


11/1/11

words selected in feature selections should be the highest correlated to the class AND least correlated from each other.


--
Mohal 

10/27/2011

In non-parametric learners, we have to look at all of the training data vs in parametric learner, we just have to remember a learned representation invarious of the training data size.


--
Mohal

10/27/2011

When training data is small, the prior should have an effect but after consedering large amount of training data, sticking with prior is irrational. 


--
Mohal 

10/25/2011

Finding K nearest neighbors is same as retreiving K closest documents.


--
Mohal

10/25/2011

Since classes may not form spheres like clusters, distance based methods do not prove to be as useful.


--
Mohal

10/20/2011

While Using vector similarity as distance measure, the average pairwise distance is cosine(centroids) of the clusters.

--
Mohal

10/20/2011

multidimentional vectors are likely to be orthogonal. The distance in this case is 0. To avoid this, the clustering should be done AFTER performing LSI and reducing the dimentions. 


--
Mohal

10/14/2011

K means tries to find sperical clusters even the original data is not spread in that fashion.

--
Mohal

10/14/2011

Increasing the purity of clusters by increasing the number of clusters does not work as one can put each element in its own cluster.

--
Mohal 

12/1/2011

Information Integration is combining information gathered from multiple places that might have multiple different schema. Sabre is a database used by all airlines except Southwest in where all flight information is put, so other websites such as Kayak and Expedia retrieve and display to consumers information about multiple airlines at the same time.


--
Ivan Zhou


Fwd: 11/22/2011


A extraction mode could do things such as segmentation,
classification, clustering, and association.

--
Nikhil Pratap

11/22/2011

A extraction mode could do things such as segmentation,
classification, clustering, and association.

12/1/11

Various models of II:
Search (minimal collation/integration)
Mediator (acts like a broker)
Warehouse (holds lots of data but inadvisable
when data is rapidly changing)

M.

11/29/2011

Maximum Likelihood, Most likely state sequence, Observation Likelihood are three important aspects of HMM model in Information Extraction.

-------Abilash

11/29/2011

PMI is used for assessing the accuracy of the facts based on the HITS.
 
--

Avinash Bhashyam

11/22/2011

An interesting aspect to ponder upon in Information Extraction is web pages are not random but in fact are written by humans there by enabling the possibility to understand the structure and content.

--------Abilash

11/22/2011

Information extraction from web doesn't require full NLP because there is significant regularity on the web.So some structure can be found from template driven pages and wrappers can extract information using structure.

--

Avinash Bhashyam

11/29/2011

In supervised learning the links are explicitly given as the probability is smoothened and thus all links would have a non zero probability of occurrence.

--
Srijith Ravikumar

11/08/2011

Competitive ratio, is the ratio of performance of an online algorithm to the offline algorithm.

--Shaunak Shah

11/17/2011

RDF standard can be used to write base facts in XML.

--Shaunak Shah

Saturday, December 3, 2011

11/17/2011

Schema mapping is integration between the attributes specifies relation between attributes; This can be written in OWL

Anuj
 

11/17/2011

Boolean satisfyability is known as empty complete-- because given a set of boolean constraints its hard to get variable which satisfies these conditions.
In general 1 sat is east to solve, 2 sat is polynomial and 3 sat is empty complete.

Anuj

Friday, December 2, 2011

12/1/2011

A major issue with data integration is that many different people make different database schemas to store the same data(ie across web, businesses etc..). Sometimes the schemas map one to one to each other, but more often they map in a non-trivial manner and may not even store all the same information.

-James Cotter

11/29/11

When evaluating the most likely state
sequence note:
The most likely label of a single word
may be different from its label in
the most likely sequence.

M.

Wednesday, November 30, 2011

11/29/2011

Wrapper class is used for extraction when pages are generated with identical schema. Along with generation of wrapper class even maintenance of it is also very necessary.

Preethi

11/29/2011

DOM trees are used for writing the patterns of data in web pages which can be extracted. There are lot of regular web pages from which information can be extracted easily.

Preethi

11/15/2011

XQuery- Every xquery may not represent a unique SQL query. It may have queries for meta-data. Converting it to SQL queries is a non-trivial conversion.
It may be converted to a single or multiple SQL queries.
 
-Rashmi Dubey

11/17/2011

Deductive Database is the combination database with tables plus some additional background knowledge.
 
-Rashmi Dubey

11/22/2011

Information Extraction (IE) lies in the middle of Information Retrieval & NLP approaches. It aims to extract information from semi-structured data like Referenced Papers, Wikipedia,etc.
 
The regularities of the web-pages enables IE without NLP.
 
-Rashmi Dubey

11/29/2011

In Collective Classification, the neighbours of a node define constraints on it.
 
-Rashmi Dubey

Tuesday, November 29, 2011

11-29-2011

We can assume that a Hidden Markov Model exists behind every possible grammatically correct sentence. If we can develop a rudimentary HMM for a language like English, we can use it to predict the likelihood of a sentence being said. We can take an imperfect speech recognition software and use the sounds the program hears to generate a list of possible sentences that sound similar. We use the HMM to determine how likely each of these sentences is to be said, relative to each other. This gives us a good idea of what the human actually said, rather than picking one possibility at random, since real sentences are far more likely than nonsense that simply rhymes with the sentence.
~Kalin Jonas

11/29/2011

The three useful HMM tasks are observation likelihood, decoding, and learning.

- Elias

11-17-2011

Humans writing webpages can do RDF-Triple extraction, or force a program to do it for them if they want. They can further specify an OWL source of axioms to facilitate even more interesting queries.
Kalin Jonas

11-15-2011

Tags carry meaning, but only to humans. To gather meaning from them, a machine must use interrelations between the tags.
Kalin Jonas

11/22/2011

Information Extraction encompasses all the following four areas: segmentation (separation of words), classification(context), clustering(assignment of group) and association(relation with other words).

-Ivan Zhou

--
Ivan Zhou
Graduate Student
Graduate Professional Student Association (GPSA) Assembly Member
Eta Kappa Nu (HKN) Active Member 
School of Computing, Informatics and Decision Systems Engineering
Ira A. Fulton School of Engineering
Arizona State University

11/22

Closed sets and regular sets are common pattern types, both of which are significantly easier to parse than complex patterns.  Closed sets have a predefined set of values which would allow one to enter data statically (such as states, countries, or capitals).  Regular sets match regular expressions but are often too numerous to make enumeration worthwhile (such as phone numbers or shipping tracking numbers).  Complex patterns such as postal addresses may be recognizable for humans but difficult to interpret for machines 

-Christophe

11/17/2011

RDF-Schema specifies information about classes in a schema, properties  and relationships between classes. It is a basic set of nodes and relations that can be used to express properties of classes of the schema.

-- Dinu

11/22/2011

Information extraction is between Information Retrieval and NLP.

-- Dinu

11/22/2011


Truthness of extractor can be evaluated using f-measure

Wrapper induction can be used for accurate performance.

-Rajasekhar

11/22/2011


Information Extraction = segmentation + classification + clustering + association

           -   Initially segment the text
           -   Classify the segmented words into classes
           -   Association to find correlation between words
           -   Now cluster the classified segments into groups

-Rajasekhar

Friday, November 25, 2011

11/15/2011

XML is a language for specifying structure of a text. It syntactic language which does not hold any knowledge of interrelations between text. 

From DB point of view XML gives an option to represent data with less structure. and vice versa from IR point of view. 


--Shreejay

11/22/2011

Most of the Information Extraction (IE) done now a days uses the structure of the web instead of relying on NLP. (Most of the data on the web has some form of structure). 

IE consists of 4 processes : 1) Segmentation 2) Classification 3) association 4) clustering

--Shreejay

Wednesday, November 23, 2011

11/22/11

Information Extraction= Segmentation + Classification + Clustering +Association 
    Segmentation : Segment portion of text which are entities of interest
    Classification : Use Classifier to group segments in classes using training data.
   Association : Associate them into groups based on relationships.
   Clustering :  Clustering them into tuples.

-Aamir

11/22/2011

The web is not random at all. Structure exists, it just isn't well defined.

Figuring out this structure and extracting information from it is
highly useful for relating things such as documents that might
reference each other, but do not actually contain a hyperlink.

Again, we are not trying to fully understand what the documents mean,
we just want to obtain a limited amount of information.

A extraction mode could do things such as segmentation,
classification, clustering, and association.

Information extraction falls somewhere between IR and NLP. In a sense
IE is a limited subject specific version of NLP.

-Thomas Hayden

Tuesday, November 22, 2011

11/17/2011

OWL is used to write the background knowledge. OWL is a sub set of first order logic and it does not allow defeasible logic.
 
-Bharath

November 22, 2011

Several techniques can be used to extract text from documents Some of these include sliding window, finite state machines and extracting syntax trees.

--James Cotter

November 17, 2011

RDF is a standard for writing base facts in XML syntax.

--James Cotter

11/17/2011

Information Extraction is really a continuum.

Easy extraction is usually rule based and is called by the misnomer.

From structured data that is wrapped into short
English format for people, we can get structured data back by
unwrapping.  This is not NLP but a pattern based 
picking of words.

--
Nikhil Pratap

11/17

RDF Schema (now called OWL) is actually background knowledge of the data in RDF.  This allows for more powerful inferences to be made about the data. This still does not provide full first order logic although it does provide some additional funtionality.

-Christophe

11/17/11

The basic idea of Semantic Web is for every web page
to have meta data that consists of RDF triples and possibly
links to an OWL database that provides semantics via OWL axioms.
This concept is somewhat a bust due to the lack of use by people
producing web pages.  BUT, semantic web is still useful as an 
inter-operable standard for exchanging structured data.

M.

11/15

XML Schema allows for one to enter any data regardless of whether it has any meaning. XML Schema allows one to check whether the syntax is valid, although the data could still be wrong, or have no meaning at all

-Christophe

11/10

CSS allows for the translation of one XML tag format to another XML tag format.  XML Style sheets allow for general translation

-Christophe

11/8

competitive ratio between offline and online results, offline is easier than online and should always have superior (or equal) result, as such the efficiency of an online method can be bench-marked using the offline method as a reference

-Christophe

11/3

document/word correlations are strong; if a document does not contain a word then the word is not relevant/unliked.  On the other hand user/item correlations are weak; if a user had not rated an item (s)he may just not have purchased it.

-Christophe

11/1

feature selection is similar to dimensionality reduction from LSI, it will reduce the amount of information that must be processed and may increase the chance of correct  predictions

-Christophe

10/27

one can decrease the effects of training data that is incomplete or misrepresents the data (eg coin flips with a low number of flips) by using smoothing, this can be done by creating new samples which will pull the knowledge base closer to a pre-determined bias (such as 50-50 ratio on coin flips)

-Christophe

11/17/2011

RDF has XML syntax. Hence there will be XML schema for each RDF file.
Also, SPARQL is used to query RDF data.

Thanks
Sandeep Gautham

11/17/2011


Deductive database: A database of tables with some background knowledge

- RDF is a particular sub syntax of xml. RDF files also have a xml schema.

- SPARQL is a query language for performing operation on RDF triplets

- OWL is a subset of 1st order logic

-schema mapping is a relation between attributes.

-Rajasekhar

11/17/2011

RDF-Schema defines schema vocabulary which supports definition of ontologies which gives an extra meaning that specifies how the term should be interpreted.
------Abilash

11/15/2011

RDF is a standard for writing base facts in XML syntax whereas, for writing domain knowledge in XML syntax OWL/RDF schema are standards.

-----Abilash

11/10/2011

An XML document is termed to be well formed if it has matching tags.
-----Abilash

11/08/2011

Ranking Advertisements poses an interesting problem, for e.g. consider a case where 1st advertiser might post something offensive and users are reluctant to view the next advertisements which may adversely affect the search engine and advertisers.
-----Abilash

11/17/11

Inference over logical databases (such as that represented by RDF) is analogous to theorem proving.

Andree

11/08/2011

An interesting aspect of web advertising is the fact that it is very difficult to satisfy search engines, advertisers and users altogether.  On one hand search engine and advertisers look for more clicks. On the other hand users look for good ads.
-----Abilash

Monday, November 21, 2011

11/15/2011

Without knowledge representation the inference is linear in the size of the database but once we have background knowledge even though the knowledge database is small, reasoning will be exponential and also the query complexity also increases and hence inference on semantic web is not linear in the size of the data.

Anuj

 

11/15/2011

XML file neednot have a machine accessible meaning. We should have a way of mapping XML schemas i.e. be able to relate different tags; This is possible by writing background information you have about a domain in any of the formal languages. OWL/RDF schemas are the standards for writing the domain knowledge in XML syntax.

Anuj

 

11/15/2011

Xquery does everything that SQL does in databases;  It also does things what IR wants it to do(text querying); Moreover it also allows us to change the format of the data(rewrite the data in any format.)

Anuj
 

11/17/2011

XML does not provide any means of talking about the semantics so we have RDF for writing binary predicates

--
Avinash Bhashyam

11/15/2011

xQuery is used to query on xml documents just as sql on databases

--

Avinash Bhashyam

Saturday, November 19, 2011

11/17/2011

RDF describes about the data in hand. OWL represents the background knowledge stored in the learner. The background knowledge learning is tough because its hard to set the level of abstraction required for the data.
--
Srijith Ravikumar

Friday, November 18, 2011

11/17/2011

RDF is "base facts" and RDFS is "background facts or rules" written using XML

--
Sathish

11/17/2011

There are some instances in which we need to query multiple databases
to retrieve data. These databases might have different schemas and a
way to retrieve the data could be writing two queries, but the better
way would be to use a schema mapping (integrator) and use one query.
This is where OWL comes into play.

-Ivan Zhou

Thursday, November 17, 2011

11/17/2011

Some points on RDF:
  • In RDF, semantic meaning can be formed by considering the connection between proposition or predicate symbols.
  • Deductive databases contain tables with some background knowledge.
  • OWL provides background knowledge in case of RDF.
  • OWL is an XML file having its own schema as an RDF file.
  • With respect to Semantic Web, normally people are unhappy with both the input and output as the query complexity increases and even the output returned will not be perfect as desired since the query will be imperfect.
  • For integration of structured data in many databases, schema mapping is done between relationships between attributes in databases. For this schema mapping OWL is used.

Preethi

11/15/17

Query complexity is much higher if you allow background
knowledge.  So inference on Semantic Web which has RDF
is not linear with the size of the data.

--
Nikhil Pratap


11/17/2011

Points to Ponder:

- What is the right level of generality to knowledge?
- First order logic is monotonic.
- RDF is isomorphic to databases.
- Learning background knowledge is harder than learning instance knowledge.
- Background knowledge is small but universal.
- Boolean satisfiability is NP complete problem. 



- Archana

11/17/2011

RDF is a standard for writing base facts in XML syntax.

- Elias

11/15/2011

Kids learn syntax before semantics, for that specific meaningless
phrase even though they dont know the meaning. But they are able to
know that the syntax is fine.

-Ivan Zhou

11/10/2011

XML from an IR point of view should be ordered(where the order of word occurences in text data is important) whre XML from a Database point of view need not be ordered(tuples can interchange their order and so can the columns in the table)
 
Anuj
 

11/10/2011

Structure helps in querying and XML can be used to specify structure. Extra structure can be exploited--giving higher weights to words which occur in the header tags etc.
 
Anuj

11/15/11

When the same subject is being described in different languages in XML, an ontological mapping may be used to reveal semantically equivalent sections.

Andree

11/15/2011

If the data is in the form of RDF or OWL, the query language used is called Sparql
 
-Bharath

11/15/2011

XQuery has an expression called a FLWOR expression which is similar to SQL statement which has select, from, where and order by clause which is used to query XML data. FLWOR is an acronym for: FOR, LET, WHERE, ORDER BY, RETURN.

-- Dinu

11/15/11

In XML, the difference between data and metadata
disappears.    All valid tags.

As legitimate XQueries allow queries on this metadata,
converting to a SQL query is non-trivial.

M.

11/15/2011


If we convert normal data to semi structured data(XML) we need to define schema. 

RDF provides XML from content.

OWL/RDF schema are standards for writing domain knowledge in XML schema

-Rajasekhar

11/10/2011


To exploit big structure we need a big inverted index.

MS office files are also in xml format, as xml adds structure to the content.

-Rajasekhar

11/8/2011


competitive ratio is always <1

vickery action:
    The advertiser with higher bid wins but pays for the second higher bid
-Trust in the mechanism is important in bidding.

-Bayapu Rajasekhar 
 

11/3/2011



Content based filtering is based on analysis of content. Most popular applications such as Pandora uses this approach for songs recommendation. web sites like Amazon uses collaborative filtering approach as it recommends users for new product purchase based on other purchases. 

-Bayapu Rajasekhar

11/15/2011

XQuery can be used for SQL search, IR search ( keyword search) and rewrite data in other formats.
XQuery allows change of tags. Say change of tags from <writer> to <author>

-Sandeep Gautham

1/11/2011


Feature Selection:

Selecting the words over which we are going to learn is known as feature selection.
Collaborative filtering initially don't have any items as it depends on other users activities or transactions record(same as query logs). It faces the cold start problem. 

-Rajasekhar

10/27/2011


-To achieve best split between we can use support vector machines

Classification:
     One of the best techniques for classification is Naive Bayes classification. We can find the probability of occurring of particular item in a class(group) or not.

-Rajasekhar

11/15/2011

xquery allows for just about any query you can do with SQL plus IR type queries on XML documents.

-James Cotter

Wednesday, November 16, 2011

11/15/11

RDF is a standard for writing base facts in XML format.

It supports only 2-ary relation but a N-ary relation can be broken down into several 2-ary relations

-Abhishek

11/15/11

  • XML tags have no meaning.
  • Meaning comes from inter-relation. 
  • Machines don't have any background knowledge like humans. 

-Abhishek


11/15/11

  • DTD or XML schema can only validate an XML syntactically  i.e the XML is well formed.
  • They can't validate an XML semantically
  • DTD is not in XML format.
- Abhishek


11/18/2011

XML from database point of view:

Unlike IR, we have to worry about query with XML in database.

XQuery is a standard query language for use with XML and more
information is available from W3C.

DTD can be used to enforce schema.

XQuery is "similar" to SQL and does many/all of the things SQL can if not more.

XML schema only provides us with loose meaning. The schema's meaning
has to be decided upon.

-Thomas Hayden

11/08/2011

Performance based advertising works, it's a multi-billion dollar industry.

Competitive ratio - accuracy of online computation/optimal accuracy

Expected return - clickvalue * likelihood of click.

Diversity becomes important when showing more than one ad, for
instance you don't want to show competitor ads together.

It might be good to bias high CTR ads first when showing multiple ads,
so that users aren't turned off by lower CTR (offensive) ads.

Vickery Auction: Winner only pays price of second highest bidder. This
in practice produces truthful bidding strategies.

-Thomas Hayden

11/03/2011

The user-items relation can be thought of in many ways as similar to
the document-words relation used previously.

User-items relation allows us to do things like user-user vector space
comparisons.

Cold start - Issue of starting with no data.

Just because two users have everything in common doesn't meany
anything if everything is little. Significance weighting.

Content Boosted - The concept of using learned info to compute info
about new content.

-Thomas Hayden

11/01/2011

Likelihood of probability in bayes network can be computed by the
expression of its dependents.

These expressions can be very large. Overflow could be avoided by
working with the log of value instead of the original.

Sample bias and 0 errors could be prevented by multiplying each
probability by 1/V, where V is some "virtual" document count.

Feature selection is important for both performance, and accuracy
reasons. Including too much information can actually lower the
"correctness" of the results.

Diversity of features can be just as important as similarity.

-Thomas Hayden

10/27/2011

Parametric VS non-parametric learning:

parametric:
size is fixed to a set of parameters

non-parametrix:
size relative to size of training data


Generally training time has two costs, examples (number) and processing time.

There are a number of good text based learning machine techniques, but
naive bayes nets are a good starting point.

Naive bayesian assumes that all attributes are independent.
Information is lost, but this form is much faster and still quite
good.

Some algorithms are good at dealing with missing and incremental
additions of data, bayesian is are one of these.

Smoothing can be done to prevent erroneous training examples from
jumping to irrational conclusions. This could be done by prepping the
training data with a virtual uniform value such as 50/100.

-Thomas Hayden

Fwd: 10/25/2011

Classification Learning:
Unsupervised learning (clustering) and supervised learning
(classification) are two extremes, but often techniques exist that
fall between that can work with labeled and unlabeled data.

Relevance feed back - Define relevance based on user feedback either
direct or indirect, such as clicking a link for query results.

RF can be used to rewrite the original query to improve the desired
precision and recall.

-Thomas Hayden

10/25/2011

Classification Learning:
Unsupervised learning (clustering) and supervised learning
(classification) are two extremes, but often techniques exist that
fall between that can work with labeled and unlabeled data.

Relevance feed back - Define relevance based on user feedback either
direct or indirect, such as clicking a link for query results.

RF can be used to rewrite the original query to improve the desired
precision and recall.

Tuesday, November 15, 2011

11/15/2011

For a relational database when XML is used as front end, the input is given as x- query and the output is given out in XML format.

Preethi

Re: 11/15/2011

RDF is a standard for writing basic facts in XML.It allows relation between 2 entities at a time and allows to write only binary predicates .The background knowledge for RDF is represented using OWL.


Srividya

11/15/2011


11/15/2011

When we convert structured data into XML, we will need a schema. A schema tells the structure of the tuple.

--Archana

11/15/2011

XML is a purely syntactic standard, has nothing to do with semantics.

Shubhendra

11/15/2011

XML is a purely syntactic standard and not semantic.

- Elias

11/10/2011

Semi-structured data or XML is heavily used on the new file format for MS Office files. And the difference between XML & HTML is that in HTML the tags have actual meaning to the browser.

--
Ivan Zhou
Graduate Student
Graduate Professional Student Association (GPSA) Assembly Member
Eta Kappa Nu (HKN) Active Member
School of Computing, Informatics and Decision Systems Engineering
Ira A. Fulton School of Engineering
Arizona State University

11/10/2011

XML is a language which can be used to specify structure of text. (unstructured text, like those found in normal webpages). 
XML helps to give structure to raw text (the Web) and it helps to reduce the structure from relational data. 

11/10/2011

XML is a standard of semi-structured data model. 

It is a bridge between structured and unstructured data, 

covering the continuous spectrum from unstructured documents to structured data.


--
Nikhil

11/10/2011

XML files have structure which can be exploited to improve the precision and recall.
Example : Path Queries uses tree structure of the XML file.

- Sandeep Gautham

11/10/2011

Semi-structured data
1. organised in semantic entities
2. similar entities are grouped together
3. entities in same group may not have same attributes
4. order of attributes not necessarily important
5. not all attributes may be required
6. size and type of same attributes in a group may differ

- Archana

11-10-2011

Simply giving a matching document's URL or full structure is not usually useful as a result. Sometimes it is better to utilize the tree-like structure of the document and return matching nodes, or find common ancestors of these matching nodes, and return the subtrees rooted at these ancestors.
~Kalin Jonas

11-08-2011

A Greedy algorithm is very good at choosing which Advert a Search Engine should display. Some complications that must be considered when designing the greedy algorithm are: differing Click-Through Rates, bids on keywords rather than full queries, multiple ads per page, advertiser budgets, and untruthful bidding.
~Kalin Jonas

11/10/2011

Proximity of query terms can be analysed from the node structure of the xml

--

Avinash Bhashyam

11/08/2011

Vickrey Auction facilitates truthful bidding in case of sealed bid auctions

--

Avinash Bhashyam

11/8/20111

In the Single item Second Price Auction truthful bidding is the dominant strategy, in multi-item Second Price Auction truthful bidding is not the dominant strategy and hence the single item Second Price Auction can modified for a multi-item in such a way truthful bidding is the dominant strategy called as the Multi- item VCG auction.

Anuj

 

11/8/2011

In general if a problem can be done both in online and offline way; then the optimal solution for doing it offline cannot be worse than doing it online. But this fact cannot be exploited in search advertising as the queries cannot be predicted.

Anuj

Sunday, November 13, 2011

10/25/2011

Two classification algorithms having accuracy 100%, can be compared by seeing how fast they reach accuracy of 100% (learning curve). {Speed of leaning and amount of training data are necessary}.

Shubhendra

Saturday, November 12, 2011

11/10/2011

XML is not imposition, but it is a facilitator. The order of elements in XML matters when we look from IR side, but it does not matter from Database side.

-- Dinu

Friday, November 11, 2011

11/11/2011

XML forms a bridge between unstructured (text/web) and fully structured (databases) data. The XML tags have no meaning of their own and are understood as interpreted by the code using them.
 
-Rashmi Dubey

Thursday, November 10, 2011

11/10/11

--> XML's structure offers more sophisticated proximity measures.
--> The structure of the query must be correlative; the person submitting the query can only take advantage of XML schemas they already know. Our work focuses instead on end users who are not schema experts.

-Aamir

11/8/2011

Google at first was displaying ads based on the highest bids for the
search keywords. But later on it realized that the better approach was
to also include the click through rate in its calculations in order to
increase it's profits. In short, displaying a higher priced ad might
not mean more profits.

-Ivan Zhou

11/08/11

Multi-item 2nd Prize Auction does not inherit property that truthfulness is a dominant strategy.
Multi - item  Vickery Clark Groves (VCG) Auction solves this issue of
truthfulness by determining pricing externally.  More costly approach due to remapping so most
search engines still use Multi-Item Generalized Second Price Auction. 

M.

11/08/2011

Search advertising consists of 3 parts :-
Search engine
User
Advertiser

The utility matrix is a function of utility matrix of all three of these. The objective is to optimize this complex function such that user's utility and advertiser's ROI don't fall below a certain value.

Optimizing this function is search engine's responsibility.

-Rashmi Dubey.



11/08/2011

Search Engine(Want more clicks), Advertisers(Want more clicks that turns into buisness) and Users(Good ads with useful inforamtion) all have their own utilities.
All these utilities have to be balanced to make search engine a good adverstising system.
 
There can be many optimization function for this. One such function can be a simple linear combination expressed as
Utility = a * UtilityAdvertiser + b * UtilityUser + c * UtilitySE
where a + b + c = 1
Here utility for SE may be the cost per click in dollar
for Advertiser it may be the number of converted sales
for user the relevance of ad, whether they got some useful inforamtion
 
-Bharath

11/08/2011

Vickrey auction is a type of sealed-bid auction, where bidders submit written bids without knowing the bid of the other people in the auction, 
and in which the highest bidder wins, but the price paid is the second-highest bid.

-Sandeep Gautham

Wednesday, November 9, 2011

11/8/2011

Competitive ratio is the ratio of online to offline returns.

For instance if an offline computation for choosing which ads to display generates an optimal income of $100, and an online computation generates $50, then the competitive ratio is 50/100=1/2.

--James Cotter

Tuesday, November 8, 2011

11/08/2011

Truthful bidding (Vickrey Auction)

In a normal auction the winning bidder has to pay the price which he/she bid for. With this in mind, every bidder would try to bid a little less than what he/she thinks the item is worth. This is bad for the seller of the item since item might be finally sold for less than its worth. It might also be bad for the bidders because a bidder might value it a lot more than others and might regret it later. (Buyers remorse!) 

In Vickrey auction the bidder with the highest bid wins but pays only the second best bid. The idea here is that all bidders will give a correct bid (true value of item according to them)  if they know that they are going to pay less than what they bid. 

--Shreejay

11/08/2011

CTR: Click Through Rates 

Advertisers who show their ads on search engines are billed on the number of times these ads are clicked (not on how many times these ads are shown). CTR is the ratio of the number of times an ad is clicked and the number of times an ad is shown. 

Number of clicks / Number of times ad is shown. 

In the most simple online advertising system, its more profitable to show an ad which has a higher CTR than the one which has the highest bid. 



--Shreejay

11/08/2011

Search Advertising: 
Its a form of recommendation system which involves 3 parties (instead of 2). The search engine, the advertiser and the user. 

The optimization problem here is have least number of conflicts between these 3 parties while having the most revenue (for SE) , most ROI (for advertisers) and most relevant ads and information (for users). 



--Shreejay

11/08/11

-->In Vickrey auction truthfully bidding the amount at which you value the object is the most optimal bidding strategy.
-->In the Generalized Second Price Auction (multi-object Vickrey mechanism) truthful bidding is not a dominant strategy. All of the i highest bidders should pay the same price: the i+1st highest                 bid.

-Aamir

11/08/2011

In Vickery Auction, the winning bidder pays an amount equal to highest non-winning bid. The optimal strategy for a bidder is to reveal the true value.

-- Dinu John

11/08/2011

Competitive ratio = Online/Offline <= 1
When a query arrives, Search Engine picks an add to be shown with best bids.
Greedy algorithm would not be an ideal way to pick these adds of differnet bidders.
 
Srividya
 

11/08/2011

Competetive Ratio is the ratio in which the performance of an online algorithm (which must satisfy an unpredictable sequence of requests, completing each request without being able to see the future) is compared to the performance of an optimal offline algorithm that can view the sequence of requests in advance
 
-Sekhar

11/08/2011

A search engine should not display an advertisement based on price per click alone, but it needs to consider the probability of being clicked as well.

- Elias

11/03/2011

Pandora is an example of content based system and Netflix is a
collaborative filtering. In search advertising, when the user clicks
on the ad, then Google gets paid. And it's until 2002 that Google
based its ads on keyword searches.

-Ivan Zhou

11/03/2011

Pandora uses content based filtering to recommend songs to users. Since the items are songs the features or genes(melody, harmony, rhytm) of the song are described by listening to the songs. Each song is represented as vector in the space of these features. To create a station around the song they compute vector similarities beteween the songs and play them.
 
-Bharath

11/03/2011

User Centered Collaborative Filtering:
- Uses K-Nearest Neighbours near to the active user.
- Missing values of the active user are computed using ratings given by its neighbours. This rating is usually found using Pearson Correlation Coefficient.
 
-Rashmi Dubey

11/01/2011

Collaborative Filtering is a special case of Scalar & Association Clusters.
It is domain-free (unlike Content Based approach which is highly domain sensitive), but has a cold-start problem when there are many null values.
 
-Rashmi Dubey.

10/27/2011

Naive Bayes Classification
 
This method assumes that each attribute of a class is independent of other attributes, given the class.
 
It is given by the formula: P(Ci | E)=(P(E | Ci) *P(Ci))/(P(E))
Where Ci is the ith class.
E is the set of attributes such that - A1=v1, A2=v2...Ak=vk.
Where v1,v2.vk are values of attributes.
 
-Rashmi Dubey

10/25/2011

LSI is a special case of Linear Discriminant Analysis (LDA). In LDA every document is its own class.
 
K-nearest neighbour algorithm is good at handling non-spherical clusters.
 
-Rashmi Dubey.

10/25/2011

Classification is a type of supervised learning. Labelled data is available on which a classifier is trained and later the trained model is used to predict the unlabelled data.
Some Applications of classification:
1. Finding Spam vs Non-Spam mails.
2. Fradulent Text Transactions.
3. Recommendation Systems.
4. Relevance Feedback.
5. Which links to crawl.
 
-Rashmi Dubey.

10/20/2011

Hierarchical Clustering is the type of clustering which creates a tree structure.
There are two type of HC:
1. Hierarchical Agglomerative Clustering :  (Going up the tree)
    -Start with K(=N) clusters where N is total number of data points.
    -At each step compute pairwise distance between each all pts, group the points closer to each other.
    -Repeat until K=1.
 
2. Hierarchical Divisive Clustering : (Going down the tree)
    -Start with K=1 cluster.
    -It is similar to bisecting k-means.
 
-Rashmi Dubey.
 

11/3/2011

Item-centered collaborative filtering is starting with a "centered" user-item matrix, found k-nearest users to the active user and used them to recommend unrated items, which will give us k-nearest items for each item.
As to search advertising, we should balance three parties: users, search engine, advertisers, and their differeing utility models.
 
-- Shu

11/1/2011

Both MI and LSI are dimensionality reduction techniques, MI is looking to reduce dimensions by looking at a subset of the original dimensions and does feature selection w.r.t. a classification task.
Content-Based recommending is based on information on the content of items rather than on other users' opinions, while collaborative filtering  can think of users as vectors in the space of items.

10/27/2011

Representations of text are very high dimensional, methods that sum evidence from many or all features (e.g. naïve Bayes, KNN, neural-net) tend to work better than ones that try to isolate just a few relevant features.
Naive Bayes classifier based on bayes networks and it works better if smoothing probability sometimes and we can use M-estimates to improve probablity estimates.

10/25/2011

Rocchio for Relevance Feedback: modify existing query based on relevance judgements, extract terms from relevant documents and add them to the query, re-weight the terms already in the query.

10/20/2011

K-means clustering: for each point, put the point in the cluster to whose centroid it is closest, recompute the cluster centroids, repeat loop (until there is no change in clusters between two consecutive iterations).
Hierarchical clustering methods: divisive (bisecting k-means) and agglomerative. Buckshot clustering: combines HAC and K-Means clustering.
Clustering on text: use LSI to reduce dimensions before clusterings.
-- Shu

10/13/2011

A good clustering is one where 
(Intra-cluster distance) the sum of distances between objects in the same cluster are minimized,
(Inter-cluster distance) while the distances between different clusters are maximized.
Purity of clustering= Sum of pure sizes of clusters/Total number of elements across clusters.
k-means method use Iterative way to improve clustering.
- Shu

Monday, November 7, 2011

11/3/2011

In search advertising, there are three main human parties involved. The users want ads that are relevant to them, and that they'll want to click. The advertisers want users to click their ads and pay the least amount possible to the search engines, and the search engines want to maximize their profits. Two metrics used in measuring these interests are click-through rates and price-per-click rates.
Kalin Jonas

11/03/2011

Collaborative filtering is not domain specific and depends on other users preferences while Content based is domain specific and depends on content

--

Avinash Bhashyam

10/27/2011

Naive Bayesian Classification doesn't catch correlation. 

--

Avinash Bhashyam

Sunday, November 6, 2011

11/6/2011

Content based Vs Collaborative filtering
Advantages
Content Based
Ability to Recommend to users with unique taste and unpopular items.
Collaborative filtering
Access to the views of other users on particular item.
Drawbacks
Content Based
Inability to exploit the opinions of other users.
Collaborative filtering
Cold start problem, inability to recommend an item not been previously rated.

-----Abilash


 

11/1/2011

Feature selection plays a pivotal role in Naive Bayes Classifier. Including too many features might lower performace and might also slow down learning process.

------Abilash


11/1/2011

Feature selection plays a pivotal role in Naive Bayes Classifier. Including too many features might lower performace and might also slow down learning process.

11/3/2011

For collaborative filtering, the average of all the ratings of a user is used as a sort of midline to relate the rating to other users. The justification for this is that some users are harder to please and will rate things lower than other people in general.

-James Cotter

Saturday, November 5, 2011

11/03/2011

The advertisers get to pick for what keyword they want their ad to be shown for.

--Dinu John

11/03/2011

Advertiser is charged only if ad is clicked on.

-- Dinu

Friday, November 4, 2011

11/3/2011

Unlabeled data helps in learning if it belongs to the same distribution as the labeled data.

--Shaunak Shah

11/04/2011

The content based filtering is domain sensitive and collaborative filtering is domain free.
Collaborative filtering suffers from cold start problem as we might not have the sample data to initially start off with.
User centered collaborative filtering:
Pearson correlation coefficient between ratings for active user and another user  is given as 

Ca,u = Covar(ra,ru)/StdDevia(ra)/ StdDevia(ru)


Its similar to dotproduct formula


Srividya

11/03/2011

Collaborative Filtering
--------------------------------
It involves collaboration among multiple different Users ( its like finding some one with the same interest as yours )
It does not know what it is suggesting.

Content Based Filtering 
--------------------------------
Its based on analysis of its content.
It does know what it is suggesting.


-Vaijan

Thursday, November 3, 2011

11/03/2011

Unlabeled examples doesn't help always.
They help only when they are taken from same distribution of labeled examples.

-Sandeep Gautham

11/03/2011

Unlabelled Data will be helpful  only when they are taken from the same distribution as the labelled data.

-Sekhar

11/03/2011

The problem with content based method is that it's domain specific.

- Elias

11/1/2011

Not all users have seen every film or read every book in a category. We can use content-based filtering to predict what they probably would rate those books and movies, and use these virtual results as if they were real, to get a better result using collaborative filtering. These virtual results produce results so accurate, that weighting these virtual results in some way is not even necessary.
~Kalin Jonas

11/1/2011

The recommendation system rely heavily on feedback which can be non-intrusive (when the user optionally gives feedback) and intrusive (when the user gets asked to provide feedback). Netflix is one of the big companies that heavily make use of the recommendation system.

--
Ivan Zhou
Graduate Student
Graduate Professional Student Association (GPSA) Assembly Member
School of Computing, Informatics and Decision Systems Engineering
Ira A. Fulton School of Engineering
Arizona State University

11/1/11

Content-based recommendation systems keeps track of user data, then learns from it to recommend things, which is a way to apply the previously learned classification algorithms.  

Andree

11/01/2011

Naive Byes classification makes lots of assumption and still works well. Though the probability estimates is of low quality, relative ordering of the class probabilities are correct i.e its classification decisions are good.

-Bharath

11/03/2011


To accomplish Diversity in Feature Selection
apply Mutual Information process twice.
As normal, calculate Mutual Information of word
and class. Do sort. Then determine next ranked
by computing the mutual Information from next highly
ranked word and the previous word.
 
-- 
Nikhil 

10/1/11

To emphasize a point,
NBC with feature selection does
reduce the over-fitting problem with respect to f-measure.
 
Reason: No longer fixated on irrelevant features
as in case of taking almost every feature
Also:Fewer features has, of course ,faster calculation time.
 
M.

11/01/2011

NBC will compute probability highly erroneously, what matters is not the exact probability but the relative order. There are more cases where probabilities are wrong and relative order is correct.

-- Dinu

10/27/2010

In NBC we learn probabilities from the data and it computes posterior probability distribution on the class. Here the assumption is classes directly cause attributes and there is no intermediary.

-- Dinu

11/01/2011

Feedback detection can be intrusive and non-intrusive.
Intrusive detection is in a way explicitly asking the user to rate the items.
On the other hand, in non-intrusive detection, we follow the user actions.

-Sandeep Gautham

11/1/2011

NBC makes an assumption that all the attributes are independent of each other.
This reduces the computation of probabilities from n*d^k to n*d*k.
Where n is the number of classes, k is the number of attributes and d is the different values each of these attributes can take.

-Sandeep Gautham

11/01/2011

Feature Selection:
1. Removing irrelevant features
2. Picking up some of the features from original space
3. Done with respect to a class. 


- Archana

Tuesday, November 1, 2011

11/1/2011

Collaborative filtering faces the problem of cold start i.e not having a large user population.

--Shaunak Shah

11/1/2011

To improve probability estimates and avoid overflow errors, do addition of logarithms of probabilities instead of multiplication of probabilities.

- Elias

10/27/2011

There are times we need smoothing, such as when flipping a coin for the probability. And if we use samples, we know it's wrong because wemknow the prior probability.

-Ivan Zhou

--
Ivan Zhou
Graduate Student
Graduate Professional Student Association (GPSA) Assembly Member
School of Computing, Informatics and Decision Systems Engineering
Ira A. Fulton School of Engineering
Arizona State University

10/27/11

In many cases, compression techniques are good learning techniques..Effectively they parametrically summarize the data.

-James Cotter

10/27/2011

Naive Bayes makes assumption that all attributes are independent. If this assumption was not made it makes the computation harder. If a node has more than one parent than the probability of that node is calculated given each configuration of its parent. 

-Bharath

10/27/2011

When trying to perform classification, we can begin by assuming that
each object should have a uniform chance of being in each category. As
we gather more and more samples, we must be sure to understand that
this is not always the case. We do this by adding "virtual samples"
which means that we pretend we have received M samples that are
uniformly distributed between the categories. As the empirical sample
size approaches and passes the size of M, this model begins to "trust"
its empirical samples more than the virtual samples.
~Kalin Jonas

Monday, October 31, 2011

10/27/2011

Naive bayes is one solution to the problem of classification, where given some data (for example emails) and some assumptions, we apply Bayes' Theorem to try and gauge the probability that it is classified in a class or not (e.g. spam or not).

Andree

Sunday, October 30, 2011

27/09/2011

Interesting point noticed while analyzing project part 2 results.

It is expected to rank page based on following equation: W*Pagerank + (1-W)*TF/IDFSimilarity, where Pagerank is a probability of importance of a page against entire document corpus (including irrelevant documents), and TF/IDF Similarity is the probability of similarity measure of a page in the cluster formed by relevant documents. If W is not carefully defined then this probability consideration would result in some irrelevant pages getting ranked high over relevant page.

--bhaskar