Tweet Notes (CSE 494/598 F11): December 2011

Monday, December 5, 2011

09/27/2011

Hubs point to Authority. Hubs value is calculated by the sum of all the Authorities points they're pointed at,

and Authority's value is calculated by the sum of the hubs' points that are pointing at them.

In page rank / auth and hub power Iteration is the fastest way of calculating primary eigen vectors

and the rate of convergence depends on the eigen gap.

--
Nikhil Pratap

The challenge with data integration is that, the data is spread over different sources with different schema and lot of time is consumed cleaning, migrating these data and integrating into one unit.

-- Dinu John

12/1/2011

Information integration is the process of combining data from difference source and giving one view of the entire data in a federated manner.

-- Dinu John

11/29/2011

Extraction pattern can be specified as path from the root of the DOM tree using Xquery or we can write regular expression to extract particular portion of the web page.

-- Dinu John

11/29/2011

The process of extracting data from webpages is referred as screen scraping.

-- Dinu John

11/29/2011

Extractors for a semi-structure web site are referred as wrappers. These wrappers are custom wrappers and have to be modified if the structure of web page changes.

-- Dinu John

11/29/2011

HTML structure of the web pages belonging to a web site is specific and regular. They all follow a specific pattern and have a template.

-- Dinu John

10/27/2011

In Parametric learning the size of the learn representation is separately described and is not promotional to the training data.

E.x: Linear Classifiers.

Non parametric learners keep entire training data around.

E.x: K-NN

Problem with K-NN is that it doesn't do well in high-dimensions.

--
Nikhil Pratap

10/25/2011

Relevant Feedback always rewrites the query such that the rewritten query

will always have high precision and recall.

--
Nikhil Pratap
Graduate student
Department of Computer Science
(Arizona State University)

Sunday, December 4, 2011

11/03/2011

In the Content-Based Recommending, recommendations are based on information

on the content of items rather than on other user's opinions.

Adv:

We can recommend to users with unique tastes.

We will be able to recommend new and unpopular items.

--
Nikhil Pratap

11/1/2011

The Collaborative filtering follow the following steps:

Weight all users with respect to similarity with the active user.

Select a subset of the users (neighbors) to use as predictors.

Normalize the ratings and compute a prediction from a weighted combination of the selected neighbors ratings.

Present items with highest predicted ratings as recommendations.

--
Nikhil Pratap

11/1/11

words selected in feature selections should be the highest correlated to the class AND least correlated from each other.

--
Mohal

10/27/2011

In non-parametric learners, we have to look at all of the training data vs in parametric learner, we just have to remember a learned representation invarious of the training data size.

--
Mohal

10/27/2011

When training data is small, the prior should have an effect but after consedering large amount of training data, sticking with prior is irrational.

--
Mohal

10/25/2011

Finding K nearest neighbors is same as retreiving K closest documents.

--
Mohal

10/25/2011

Since classes may not form spheres like clusters, distance based methods do not prove to be as useful.

--
Mohal

10/20/2011

While Using vector similarity as distance measure, the average pairwise distance is cosine(centroids) of the clusters.

--
Mohal

10/20/2011

multidimentional vectors are likely to be orthogonal. The distance in this case is 0. To avoid this, the clustering should be done AFTER performing LSI and reducing the dimentions.

--
Mohal

10/14/2011

K means tries to find sperical clusters even the original data is not spread in that fashion.

--
Mohal

10/14/2011

Increasing the purity of clusters by increasing the number of clusters does not work as one can put each element in its own cluster.

--
Mohal

12/1/2011

Information Integration is combining information gathered from multiple places that might have multiple different schema. Sabre is a database used by all airlines except Southwest in where all flight information is put, so other websites such as Kayak and Expedia retrieve and display to consumers information about multiple airlines at the same time.

--
Ivan Zhou

Fwd: 11/22/2011

A extraction mode could do things such as segmentation,
classification, clustering, and association.

--
Nikhil Pratap

11/22/2011

A extraction mode could do things such as segmentation,
classification, clustering, and association.

12/1/11

Various models of II:

Search (minimal collation/integration)

Mediator (acts like a broker)

Warehouse (holds lots of data but inadvisable

when data is rapidly changing)

11/29/2011

Maximum Likelihood, Most likely state sequence, Observation Likelihood are three important aspects of HMM model in Information Extraction.

-------Abilash

11/29/2011

PMI is used for assessing the accuracy of the facts based on the HITS.

--

Avinash Bhashyam

11/22/2011

An interesting aspect to ponder upon in Information Extraction is web pages are not random but in fact are written by humans there by enabling the possibility to understand the structure and content.

--------Abilash

11/22/2011

Information extraction from web doesn't require full NLP because there is significant regularity on the web.So some structure can be found from template driven pages and wrappers can extract information using structure.

--

Avinash Bhashyam

11/29/2011

In supervised learning the links are explicitly given as the probability is smoothened and thus all links would have a non zero probability of occurrence.

--
Srijith Ravikumar

11/08/2011

Competitive ratio, is the ratio of performance of an online algorithm to the offline algorithm.

--Shaunak Shah

11/17/2011

RDF standard can be used to write base facts in XML.

--Shaunak Shah

Saturday, December 3, 2011

11/17/2011

Schema mapping is integration between the attributes specifies relation between attributes; This can be written in OWL

Anuj

11/17/2011

Boolean satisfyability is known as empty complete-- because given a set of boolean constraints its hard to get variable which satisfies these conditions.

In general 1 sat is east to solve, 2 sat is polynomial and 3 sat is empty complete.

Anuj

Friday, December 2, 2011

12/1/2011

A major issue with data integration is that many different people make different database schemas to store the same data(ie across web, businesses etc..). Sometimes the schemas map one to one to each other, but more often they map in a non-trivial manner and may not even store all the same information.

-James Cotter

11/29/11

When evaluating the most likely state

sequence note:

The most likely label of a single word

may be different from its label in

the most likely sequence.