Tweet Notes (CSE 494/598 F11): October 2011

Monday, October 31, 2011

10/27/2011

Naive bayes is one solution to the problem of classification, where given some data (for example emails) and some assumptions, we apply Bayes' Theorem to try and gauge the probability that it is classified in a class or not (e.g. spam or not).

Andree

Sunday, October 30, 2011

27/09/2011

Interesting point noticed while analyzing project part 2 results.

It is expected to rank page based on following equation: W*Pagerank + (1-W)*TF/IDFSimilarity, where Pagerank is a probability of importance of a page against entire document corpus (including irrelevant documents), and TF/IDF Similarity is the probability of similarity measure of a page in the cluster formed by relevant documents. If W is not carefully defined then this probability consideration would result in some irrelevant pages getting ranked high over relevant page.

--bhaskar

Thursday, October 27, 2011

10/27/2011

Neural nets consume lot of training time i.e. the time required to go from training data to learned expressions.

Preethi

10/27/2011

Parametric learners are not proportional to the training data where the parameters do not depend upon the data.Most learners are parametric.

Non parametric learners keep entire training data around.Eg..K-NN.

Preethi

10/27/2011

Support Vector Machine provide the best possible split between clusters.

-----Abilash

10/25/2011

General perspective while performing clustering process is that clusters are spherical in shape but in reality that is not the case most of the times. In order to deal with irregular clusters , clusters with in a clusters the best bet would be performing clustering process using nearest neighbor algorithm.

----Abilash

10/25/2011

Getting feedback from users on the search results proves very helpful
when computing the next results and it can be a way to personalize
them. And the problem we face with clustering sometimes is labeling
them.

-Ivan Zhou

10/25/2011

A good learning algorithm should be able to handle the noise in the training data.

Relevance feed back rewrites the query such that the re-return query will have high precision and recall.

-rajshekar

10/25/11

Rocchio Classification for text data uses existing

data clusters (ie training data) and then puts new

doc in its closest cluster (nearest centroid).

But it does not change the centroid as is done in

k-means.

10/25/2011

When LSI is done on classification problems, we pick dimensions that maximally differentiate across classes,
while having minimal variance with in any given class.This method is called as LDA - Linear Discriminant Analysis.

-Sandeep Gautham

10/25/2011

Relevance feedback can be used by reconstructing the query by adding the knowledge learned from the user feedback to the initial query..

--

Avinash Bhashyam
1202913681
Graduate Student,
Master of Science Computer Science,
Arizona State University
Phone - (312) 810-2690

10/25/11

Relevance feedback calculation is an active way to get relevance documents by treating the user's navigated documents as new training examples, in contrast to passive calculation where the training examples are all predefined at the start.

Andree

10/25/2011

Semisupervised learning is a combination of both supervised and
unsupervised learning. Since most applications require at least some
input from the user on what the categories should be, but the machine
is expected to discover categories as well, they fall under this
category rather than one of the two extremes it is composed of.
~Kalin Jonas

Wednesday, October 26, 2011

10/25/2011

Using clustering techniques for classification is not enough because
it doesn't support non-spherical data clusters very well.

-Stephen Booher

Sent from my iPad

10/25/2011

Relevance feedback is a process where the user identifies some relevant and irrelevant documents and the system creates an expanded query by extracting additional terms from the sample relevant and irrelevant documents.

-- Dinu

10/20/2011

The quality of cluster is function of intra-cluster tightness and inter-cluster distance. Here the aim is to make each cluster tight and make them far away from each other.

In K-mean we only consider intra-cluster tightness, because K is constant and if every body becomes tight they are becoming far from other cluster.

-- Dinu

Tuesday, October 25, 2011

10/25/2011

Classical learning can be broadly classified into 2 types:

1. Unsupervised learning - Clustering

2. Supervised learning - Classification

-Sandeep Gautham

10/25/2011

The pseudo relevance feed back assumes only the top k documents to be relevant.

In Rocchio feedback, changing Beta changes the recall and gamma changes the precision.

Srividya

10/25/10

In the Rocchio classification formula changing beta changes the recall while changing gama

changes the precision.

-Vaibhav Jain

10/25/2011

What is the biggest problem in applying classification techniques in the web?

Getting labeled data is the problem. Getting people to do the labeling without letting people know is a hard part. Also, this could end up in a noise or wrong data.

- Archana

10/25/2011

Euclidean distance to centroid of a class can be misleading when classifying a new object, particularly if a class is in the middle of another class or the class is disjoint(take the example in class of coastal cities). In this case, for deciding which class to place an incoming object, nearest neighbors may be a better method for choosing the correct class.

--James Cotter

10/25/2011

Most of the current search engines attempt to do Query Rewriting by User Profiling

--Shaunak Shah

10/25/2011

Clustering is unsupervised learning and classification is supervised learning.

- Elias

10/20/2011

Quality of clustering:

it can be calculated using two measures

-intra cluster tightness

-inter cluster distance

in k-means we only consider intra-cluster distance

rate of change of dissimilarity measure is considered for choosing k

-rajasekhar

10/20/2011

R trees clustering solves many of the issues with kmeans

1)no longer need to supply the number of clusters up front
2)it can consider multiple pairs of initial seds
-James Cotter

10/20/2011

KMeans functions through intracluster tightness. Can produce unoptimal
results if a poor seed is chosen. This is due to KMeans getting stuck
in a local minimum. A problem with KMeans is that you need to define
the number of clusters. One way to determine a sutable K, is to
compute the change in intracluster dissimilarity as K increases and
find points in which the change in dissimilarity begins to slow..
Another technique is hierarchical cluster which can be done through
division or agglomeration of clusters for each level of clustering.
-Thomas Hayden

10/13/2011

Cluster is sometimes called unsupervised learning. Single link - A
near node was allowed, should I. VS Complete link - I can't let
another in with much better nodes. Purity if cluster problem - Because
we could specify number of clusters equal to the number of elements,
and this would give us perfect purity! This can only work if we
enforce a number of clusters. Creating an optimal clustering is not
always simple due to the large number of permutations, instead we
create an intuitive cluster and then iteratively improve it. - Thomas
Hayden

10/20/2011

Assuming clusters are spherical in vector space in K-Means leads to sensitivity to co-ordinate changes and weighting.

--

Avinash Bhashyam

11/20/2011

Dissimilarity can be reduced by increasing the cluster numbers.

Shubhendra

11/20/2011

Dissimilarity can be reduced by increasing the cluster numbers.

10/20/2011

User needs snippets to determine the better
cluster. Can make snippets using the most frequently occurring

words. Use word frequency in cluster but still need an
inverse cluster frequency value (eliminate super common words in all clusters).

Cluster Snippet = top k words with highest tf*icf values from this cluster.

--
Nikhil Pratap

10/20/11

Clustering on High Dimensional data

like Documents is tricky as most document

pairs have a similarity distance approaching 0.

The cosine theta distance is not really

a good measure, so use the costly LSI

to reduce dimensions. The true distance

is then represented in reduced dim. space.

Manjara did this but its not practical. yet.

10/20/2011

Bisecting K-means algorithm solves two issues of k-means

1. No need to know the value of K at the beginning

2. Solves the local minima problem by considering multiple pairs for identifying the initial cluster

-Bharath

10/20/2011

Although Agglomerative Hierarchical clustering(AHC) can produce ordering of objects one of the major drawbacks could be if an object is placed incorrectly in other group initially then there is no provision for relocation of object.

-----Abilash

10/20/2011

K means is a disjoint and exhaustive outlier problem. One data point cannot be a a part of two clusters.

-Archana

Sunday, October 23, 2011

10/20/2011

K-means is the cluster mean and it represents the dissimilarity among clusters. It's a measure. And one of the problem it faces is that when we increase K dissimilarity falls; this could lead K to be the number of points/nodes in the entire problem. One of the solutions for this is sampling instead of all, for example, 3 trillion books; sampling with different K and use the best. The other solution could be to penalize when increasing the number of clusters usage.

--
Ivan Zhou

Graduate Student

Graduate Professional Student Association (GPSA) Assembly Member

School of Computing, Informatics and Decision Systems Engineering

Ira A. Fulton School of Engineering

Arizona State University

10/20/2011

K-Means Problems:

1. K - no. of clusters needs to be specifed upfront
2. K-means is sensitive to the initial seeds picked.

Advantages of K-means:
Very fast, linear in all the relevant factors.

Ramya

Saturday, October 22, 2011

10/20/11

Reiterating the point raised in class.

If the points on 2D plane is evenly distributed to form circle, then the bisecting (2 means divisive) hierarchical clustering will result in dividing the points at the diameter of the circle. After iterating this steps for n number of times, we will have 2 pow n number of clusters of equal size. If I cut the hierarchy to get odd number of of clusters then the resultant clusters will have non equal sized clusters.

--bhaskar

Friday, October 21, 2011

10/20/2011

The internal measure of clustering depends upon

Intra cluster tightness.
Inter cluster separation.

Preethi

10/20/2011

The disadvantage with K - means clustering algorithm is that the value of K should be known in advance and with Hierarchical Clustering is that it consumes lot of time (O(n^2)). So an another algorithm known as Buckshot algorithm is used as an hybrid of the other two which addresses both the problems.

Preethi

Thursday, October 20, 2011

10/20/2011

One way to address the problem of not knowing k in advance of running
k-means is to look for kinks in the tightness vs. K curve. However, it
might not be practical to run k-means for so many different k's on
large datasets. So one way to improve this is to take a random sample
of the dataset beforehand, running k-means with various sizes of k to
find the "best" k.

- Stephen Booher

10/20/2011

In k-means, the seeds you start with may not reach the global minimum of the goodness measure.

- Elias

10/13/11

Any clustering method has an internal bias

to find a particular shape of cluster.

K-Means is looking for spheres (based on

its distance measure) so it Will find spheres.

Consequently, if you have a chain of data, k-means

may not be the best clustering method. Fortunately,

it is pretty good methodology for text documents.

10/13/2011

An outlier in a neighborhood approach can be defined as the one that has less than d points with in x distance where d,x are pre-defined thresholds. Outliers are removed in pre-processing step.

-----Abilash

Wednesday, October 19, 2011

10/13/2011

Purity of clusters is sum of pure sizes / total num of elements in clusters.
Pure size of a cluster is the size of the max class in a cluster.
Purity is a simple measure to evaluate a clustering.

-Stephen Booher

Tuesday, October 18, 2011

10/13/2011

Regarding cluster purity, its computation won't work as well when the number of clusters increase. Because if the number of clusters increase and reach the number of elements, then all of them would have a purity of 1. Hard cluster is a complete member assignment, and soft cluster is a partial assignment to a cluster.

--
Ivan Zhou

Graduate Student

Graduate Professional Student Association (GPSA) Assembly Member

School of Computing, Informatics and Decision Systems Engineering

Ira A. Fulton School of Engineering

Arizona State University