Naive bayes is one solution to the problem of classification, where given some data (for example emails) and some assumptions, we apply Bayes' Theorem to try and gauge the probability that it is classified in a class or not (e.g. spam or not).

## Monday, October 31, 2011

## Sunday, October 30, 2011

### 27/09/2011

Interesting point noticed while analyzing project part 2 results.

It is expected to rank page based on following equation: W*Pagerank + (1-W)*TF/IDFSimilarity, where Pagerank is a probability of importance of a page against entire document corpus (including irrelevant documents), and TF/IDF Similarity is the probability of similarity measure of a page in the cluster formed by relevant documents. If W is not carefully defined then this probability consideration would result in some irrelevant pages getting ranked high over relevant page.

--bhaskar

It is expected to rank page based on following equation: W*Pagerank + (1-W)*TF/IDFSimilarity, where Pagerank is a probability of importance of a page against entire document corpus (including irrelevant documents), and TF/IDF Similarity is the probability of similarity measure of a page in the cluster formed by relevant documents. If W is not carefully defined then this probability consideration would result in some irrelevant pages getting ranked high over relevant page.

--bhaskar

## Thursday, October 27, 2011

### 10/27/2011

Neural nets consume lot of training time i.e. the time required to go from training data to learned expressions.

Preethi

### 10/27/2011

Parametric learners are not proportional to the training data where the parameters do not depend upon the data.Most learners are parametric.

Non parametric learners keep entire training data around.Eg..K-NN.

Preethi

### 10/25/2011

General perspective while performing clustering process is that clusters are spherical in shape but in reality that is not the case most of the times. In order to deal with irregular clusters , clusters with in a clusters the best bet would be performing clustering process using nearest neighbor algorithm.

----Abilash

### 10/25/2011

Getting feedback from users on the search results proves very helpful

when computing the next results and it can be a way to personalize

them. And the problem we face with clustering sometimes is labeling

them.

when computing the next results and it can be a way to personalize

them. And the problem we face with clustering sometimes is labeling

them.

-Ivan Zhou

### 10/25/2011

A good learning algorithm should be able to handle the noise in the training data.

Relevance feed back rewrites the query such that the re-return query will have high precision and recall.

-rajshekar

### 10/25/11

Rocchio Classification for text data uses existing

data clusters (ie training data) and then puts new

doc in its closest cluster (nearest centroid).

But it does not change the centroid as is done in

k-means.

M.

### 10/25/2011

When LSI is done on classification problems, we pick dimensions that maximally differentiate across classes,

while having minimal variance with in any given class.This method is called as LDA - Linear Discriminant Analysis.

-Sandeep Gautham

while having minimal variance with in any given class.This method is called as LDA - Linear Discriminant Analysis.

-Sandeep Gautham

### 10/25/2011

Relevance feedback can be used by reconstructing the query by adding the knowledge learned from the user feedback to the initial query..

--

Avinash Bhashyam

1202913681

Graduate Student,

Master of Science Computer Science,

Arizona State University

Phone - (312) 810-2690

Avinash Bhashyam

1202913681

Graduate Student,

Master of Science Computer Science,

Arizona State University

Phone - (312) 810-2690

### 10/25/11

Relevance feedback calculation is an active way to get relevance documents by treating the user's navigated documents as new training examples, in contrast to passive calculation where the training examples are all predefined at the start.

Andree

### 10/25/2011

Semisupervised learning is a combination of both supervised and

unsupervised learning. Since most applications require at least some

input from the user on what the categories should be, but the machine

is expected to discover categories as well, they fall under this

category rather than one of the two extremes it is composed of.

~Kalin Jonas

unsupervised learning. Since most applications require at least some

input from the user on what the categories should be, but the machine

is expected to discover categories as well, they fall under this

category rather than one of the two extremes it is composed of.

~Kalin Jonas

## Wednesday, October 26, 2011

### 10/25/2011

Using clustering techniques for classification is not enough because

it doesn't support non-spherical data clusters very well.

it doesn't support non-spherical data clusters very well.

-Stephen Booher

Sent from my iPad

### 10/25/2011

Relevance feedback is a process where the user identifies some relevant and irrelevant documents and the system creates an expanded query by extracting additional terms from the sample relevant and irrelevant documents.

-- Dinu

-- Dinu

### 10/20/2011

The quality of cluster is function of intra-cluster tightness and inter-cluster distance. Here the aim is to make each cluster tight and make them far away from each other.

In K-mean we only consider intra-cluster tightness, because K is constant and if every body becomes tight they are becoming far from other cluster.

-- Dinu

In K-mean we only consider intra-cluster tightness, because K is constant and if every body becomes tight they are becoming far from other cluster.

-- Dinu

## Tuesday, October 25, 2011

### 10/25/2011

Classical learning can be broadly classified into 2 types:

1. Unsupervised learning - Clustering

2. Supervised learning - Classification

-Sandeep Gautham

### 10/25/2011

The pseudo relevance feed back assumes only the top k documents to be relevant.

In Rocchio feedback, changing Beta changes the recall and gamma changes the precision.

Srividya

### 10/25/10

In the Rocchio classification formula changing beta changes the recall while changing gama

changes the precision.

-Vaibhav Jain

### 10/25/2011

What is the biggest problem in applying classification techniques in the web?

Getting labeled data is the problem. Getting people to do the labeling without letting people know is a hard part. Also, this could end up in a noise or wrong data.

- Archana

### 10/25/2011

Euclidean distance to centroid of a class can be misleading when classifying a new object, particularly if a class is in the middle of another class or the class is disjoint(take the example in class of coastal cities). In this case, for deciding which class to place an incoming object, nearest neighbors may be a better method for choosing the correct class.

--James Cotter

--James Cotter

### 10/25/2011

Most of the current search engines attempt to do Query Rewriting by User Profiling

--Shaunak Shah

### 10/20/2011

Quality of clustering:

it can be calculated using two measures

-intra cluster tightness

-inter cluster distance

in k-means we only consider intra-cluster distance

rate of change of dissimilarity measure is considered for choosing k

-rajasekhar

### 10/20/2011

R trees clustering solves many of the issues with kmeans

1)no longer need to supply the number of clusters up front

2)it can consider multiple pairs of initial seds

-James Cotter

1)no longer need to supply the number of clusters up front

2)it can consider multiple pairs of initial seds

-James Cotter

### 10/20/2011

KMeans functions through intracluster tightness. Can produce unoptimal

results if a poor seed is chosen. This is due to KMeans getting stuck

in a local minimum. A problem with KMeans is that you need to define

the number of clusters. One way to determine a sutable K, is to

compute the change in intracluster dissimilarity as K increases and

find points in which the change in dissimilarity begins to slow..

Another technique is hierarchical cluster which can be done through

division or agglomeration of clusters for each level of clustering.

-Thomas Hayden

results if a poor seed is chosen. This is due to KMeans getting stuck

in a local minimum. A problem with KMeans is that you need to define

the number of clusters. One way to determine a sutable K, is to

compute the change in intracluster dissimilarity as K increases and

find points in which the change in dissimilarity begins to slow..

Another technique is hierarchical cluster which can be done through

division or agglomeration of clusters for each level of clustering.

-Thomas Hayden

### 10/13/2011

Cluster is sometimes called unsupervised learning. Single link - A

near node was allowed, should I. VS Complete link - I can't let

another in with much better nodes. Purity if cluster problem - Because

we could specify number of clusters equal to the number of elements,

and this would give us perfect purity! This can only work if we

enforce a number of clusters. Creating an optimal clustering is not

always simple due to the large number of permutations, instead we

create an intuitive cluster and then iteratively improve it. - Thomas

Hayden

near node was allowed, should I. VS Complete link - I can't let

another in with much better nodes. Purity if cluster problem - Because

we could specify number of clusters equal to the number of elements,

and this would give us perfect purity! This can only work if we

enforce a number of clusters. Creating an optimal clustering is not

always simple due to the large number of permutations, instead we

create an intuitive cluster and then iteratively improve it. - Thomas

Hayden

### 10/20/2011

Assuming clusters are spherical in vector space in K-Means leads to sensitivity to co-ordinate changes and weighting.

--

Avinash Bhashyam

Avinash Bhashyam

### 10/20/2011

**User needs snippets to determine the better**

cluster. Can make snippets using the most frequently occurring

words. Use word frequency in cluster but still need an

inverse cluster frequency value (eliminate super common words in all clusters).

Cluster Snippet = top k words with highest tf*icf values from this cluster.

Nikhil Pratap

### 10/20/11

Clustering on High Dimensional data

like Documents is tricky as most document

pairs have a similarity distance approaching 0.

The cosine theta distance is not really

a good measure, so use the costly LSI

to reduce dimensions. The true distance

is then represented in reduced dim. space.

Manjara did this but its not practical. yet.

M.

### 10/20/2011

Bisecting K-means algorithm solves two issues of k-means

1. No need to know the value of K at the beginning

2. Solves the local minima problem by considering multiple pairs for identifying the initial cluster

-Bharath

### 10/20/2011

Although Agglomerative Hierarchical clustering(AHC) can produce ordering of objects one of the major drawbacks could be if an object is placed incorrectly in other group initially then there is no provision for relocation of object.

-----Abilash

### 10/20/2011

K means is a disjoint and exhaustive outlier problem. One data point cannot be a a part of two clusters.

-Archana

## Sunday, October 23, 2011

### 10/20/2011

K-means is the cluster mean and it represents the dissimilarity among clusters. It's a measure. And one of the problem it faces is that when we increase K dissimilarity falls; this could lead K to be the number of points/nodes in the entire problem. One of the solutions for this is sampling instead of all, for example, 3 trillion books; sampling with different K and use the best. The other solution could be to penalize when increasing the number of clusters usage.

--

Ivan Zhou

Ivan Zhou

Graduate Student

Graduate Professional Student Association (GPSA) Assembly Member

School of Computing, Informatics and Decision Systems Engineering

Ira A. Fulton School of Engineering

Arizona State University

### 10/20/2011

K-Means Problems:

1. K - no. of clusters needs to be specifed upfront

2. K-means is sensitive to the initial seeds picked.

Advantages of K-means:

Very fast, linear in all the relevant factors.

Ramya

1. K - no. of clusters needs to be specifed upfront

2. K-means is sensitive to the initial seeds picked.

Advantages of K-means:

Very fast, linear in all the relevant factors.

Ramya

## Saturday, October 22, 2011

### 10/20/11

Reiterating the point raised in class.

If the points on 2D plane is evenly distributed to form circle, then the bisecting (2 means divisive) hierarchical clustering will result in dividing the points at the diameter of the circle. After iterating this steps for n number of times, we will have 2 pow n number of clusters of equal size. If I cut the hierarchy to get odd number of of clusters then the resultant clusters will have non equal sized clusters.

--bhaskar

If the points on 2D plane is evenly distributed to form circle, then the bisecting (2 means divisive) hierarchical clustering will result in dividing the points at the diameter of the circle. After iterating this steps for n number of times, we will have 2 pow n number of clusters of equal size. If I cut the hierarchy to get odd number of of clusters then the resultant clusters will have non equal sized clusters.

--bhaskar

## Friday, October 21, 2011

### 10/20/2011

The internal measure of clustering depends upon

- Intra cluster tightness.
- Inter cluster separation.

Preethi

### 10/20/2011

The disadvantage with K - means clustering algorithm is that the value of K should be known in advance and with Hierarchical Clustering is that it consumes lot of time (O(n^2)). So an another algorithm known as Buckshot algorithm is used as an hybrid of the other two which addresses both the problems.

Preethi

## Thursday, October 20, 2011

### 10/20/2011

One way to address the problem of not knowing k in advance of running

k-means is to look for kinks in the tightness vs. K curve. However, it

might not be practical to run k-means for so many different k's on

large datasets. So one way to improve this is to take a random sample

of the dataset beforehand, running k-means with various sizes of k to

find the "best" k.

k-means is to look for kinks in the tightness vs. K curve. However, it

might not be practical to run k-means for so many different k's on

large datasets. So one way to improve this is to take a random sample

of the dataset beforehand, running k-means with various sizes of k to

find the "best" k.

- Stephen Booher

### 10/20/2011

In k-means, the seeds you start with may not reach the global minimum of the goodness measure.

- Elias

### 10/13/11

Any clustering method has an internal bias

to find a particular shape of cluster.

K-Means is looking for spheres (based on

its distance measure) so it Will find spheres.

Consequently, if you have a chain of data, k-means

may not be the best clustering method. Fortunately,

it is pretty good methodology for text documents.

M.

### 10/13/2011

An outlier in a neighborhood approach can be defined as the one that has less than d points with in x distance where d,x are pre-defined thresholds. Outliers are removed in pre-processing step.

-----Abilash

## Wednesday, October 19, 2011

### 10/13/2011

Purity of clusters is sum of pure sizes / total num of elements in clusters.

Pure size of a cluster is the size of the max class in a cluster.

Purity is a simple measure to evaluate a clustering.

Pure size of a cluster is the size of the max class in a cluster.

Purity is a simple measure to evaluate a clustering.

-Stephen Booher

## Tuesday, October 18, 2011

### 10/13/2011

Regarding cluster purity, its computation won't work as well when the number of clusters increase. Because if the number of clusters increase and reach the number of elements, then all of them would have a purity of 1. Hard cluster is a complete member assignment, and soft cluster is a partial assignment to a cluster.

--

Ivan Zhou

Ivan Zhou

Graduate Student

Graduate Professional Student Association (GPSA) Assembly Member

School of Computing, Informatics and Decision Systems Engineering

Ira A. Fulton School of Engineering

Arizona State University

Subscribe to:
Posts (Atom)