Naive bayes is one solution to the problem of classification, where given some data (for example emails) and some assumptions, we apply Bayes' Theorem to try and gauge the probability that it is classified in a class or not (e.g. spam or not).
Monday, October 31, 2011
Sunday, October 30, 2011
27/09/2011
Interesting point noticed while analyzing project part 2 results.
It is expected to rank page based on following equation: W*Pagerank + (1-W)*TF/IDFSimilarity, where Pagerank is a probability of importance of a page against entire document corpus (including irrelevant documents), and TF/IDF Similarity is the probability of similarity measure of a page in the cluster formed by relevant documents. If W is not carefully defined then this probability consideration would result in some irrelevant pages getting ranked high over relevant page.
--bhaskar
It is expected to rank page based on following equation: W*Pagerank + (1-W)*TF/IDFSimilarity, where Pagerank is a probability of importance of a page against entire document corpus (including irrelevant documents), and TF/IDF Similarity is the probability of similarity measure of a page in the cluster formed by relevant documents. If W is not carefully defined then this probability consideration would result in some irrelevant pages getting ranked high over relevant page.
--bhaskar
Thursday, October 27, 2011
10/27/2011
Neural nets consume lot of training time i.e. the time required to go from training data to learned expressions.
Preethi
10/27/2011
Parametric learners are not proportional to the training data where the parameters do not depend upon the data.Most learners are parametric.
Non parametric learners keep entire training data around.Eg..K-NN.
Preethi
10/25/2011
General perspective while performing clustering process is that clusters are spherical in shape but in reality that is not the case most of the times. In order to deal with irregular clusters , clusters with in a clusters the best bet would be performing clustering process using nearest neighbor algorithm.
----Abilash
10/25/2011
Getting feedback from users on the search results proves very helpful
when computing the next results and it can be a way to personalize
them. And the problem we face with clustering sometimes is labeling
them.
when computing the next results and it can be a way to personalize
them. And the problem we face with clustering sometimes is labeling
them.
-Ivan Zhou
10/25/2011
A good learning algorithm should be able to handle the noise in the training data.
Relevance feed back rewrites the query such that the re-return query will have high precision and recall.
-rajshekar
10/25/11
Rocchio Classification for text data uses existing
data clusters (ie training data) and then puts new
doc in its closest cluster (nearest centroid).
But it does not change the centroid as is done in
k-means.
M.
10/25/2011
When LSI is done on classification problems, we pick dimensions that maximally differentiate across classes,
while having minimal variance with in any given class.This method is called as LDA - Linear Discriminant Analysis.
-Sandeep Gautham
while having minimal variance with in any given class.This method is called as LDA - Linear Discriminant Analysis.
-Sandeep Gautham
10/25/2011
Relevance feedback can be used by reconstructing the query by adding the knowledge learned from the user feedback to the initial query..
--
Avinash Bhashyam
1202913681
Graduate Student,
Master of Science Computer Science,
Arizona State University
Phone - (312) 810-2690
Avinash Bhashyam
1202913681
Graduate Student,
Master of Science Computer Science,
Arizona State University
Phone - (312) 810-2690
10/25/11
Relevance feedback calculation is an active way to get relevance documents by treating the user's navigated documents as new training examples, in contrast to passive calculation where the training examples are all predefined at the start.
Andree
10/25/2011
Semisupervised learning is a combination of both supervised and
unsupervised learning. Since most applications require at least some
input from the user on what the categories should be, but the machine
is expected to discover categories as well, they fall under this
category rather than one of the two extremes it is composed of.
~Kalin Jonas
unsupervised learning. Since most applications require at least some
input from the user on what the categories should be, but the machine
is expected to discover categories as well, they fall under this
category rather than one of the two extremes it is composed of.
~Kalin Jonas
Wednesday, October 26, 2011
10/25/2011
Using clustering techniques for classification is not enough because
it doesn't support non-spherical data clusters very well.
it doesn't support non-spherical data clusters very well.
-Stephen Booher
Sent from my iPad
10/25/2011
Relevance feedback is a process where the user identifies some relevant and irrelevant documents and the system creates an expanded query by extracting additional terms from the sample relevant and irrelevant documents.
-- Dinu
-- Dinu
10/20/2011
The quality of cluster is function of intra-cluster tightness and inter-cluster distance. Here the aim is to make each cluster tight and make them far away from each other.
In K-mean we only consider intra-cluster tightness, because K is constant and if every body becomes tight they are becoming far from other cluster.
-- Dinu
In K-mean we only consider intra-cluster tightness, because K is constant and if every body becomes tight they are becoming far from other cluster.
-- Dinu
Tuesday, October 25, 2011
10/25/2011
Classical learning can be broadly classified into 2 types:
1. Unsupervised learning - Clustering
2. Supervised learning - Classification
-Sandeep Gautham
10/25/2011
The pseudo relevance feed back assumes only the top k documents to be relevant.
In Rocchio feedback, changing Beta changes the recall and gamma changes the precision.
Srividya
10/25/10
In the Rocchio classification formula changing beta changes the recall while changing gama
changes the precision.
-Vaibhav Jain
10/25/2011
What is the biggest problem in applying classification techniques in the web?
Getting labeled data is the problem. Getting people to do the labeling without letting people know is a hard part. Also, this could end up in a noise or wrong data.
- Archana
10/25/2011
Euclidean distance to centroid of a class can be misleading when classifying a new object, particularly if a class is in the middle of another class or the class is disjoint(take the example in class of coastal cities). In this case, for deciding which class to place an incoming object, nearest neighbors may be a better method for choosing the correct class.
--James Cotter
--James Cotter
10/25/2011
Most of the current search engines attempt to do Query Rewriting by User Profiling
--Shaunak Shah
10/20/2011
Quality of clustering:
it can be calculated using two measures
-intra cluster tightness
-inter cluster distance
in k-means we only consider intra-cluster distance
rate of change of dissimilarity measure is considered for choosing k
-rajasekhar
10/20/2011
R trees clustering solves many of the issues with kmeans
1)no longer need to supply the number of clusters up front
2)it can consider multiple pairs of initial seds
-James Cotter
1)no longer need to supply the number of clusters up front
2)it can consider multiple pairs of initial seds
-James Cotter
10/20/2011
KMeans functions through intracluster tightness. Can produce unoptimal
results if a poor seed is chosen. This is due to KMeans getting stuck
in a local minimum. A problem with KMeans is that you need to define
the number of clusters. One way to determine a sutable K, is to
compute the change in intracluster dissimilarity as K increases and
find points in which the change in dissimilarity begins to slow..
Another technique is hierarchical cluster which can be done through
division or agglomeration of clusters for each level of clustering.
-Thomas Hayden
results if a poor seed is chosen. This is due to KMeans getting stuck
in a local minimum. A problem with KMeans is that you need to define
the number of clusters. One way to determine a sutable K, is to
compute the change in intracluster dissimilarity as K increases and
find points in which the change in dissimilarity begins to slow..
Another technique is hierarchical cluster which can be done through
division or agglomeration of clusters for each level of clustering.
-Thomas Hayden
10/13/2011
Cluster is sometimes called unsupervised learning. Single link - A
near node was allowed, should I. VS Complete link - I can't let
another in with much better nodes. Purity if cluster problem - Because
we could specify number of clusters equal to the number of elements,
and this would give us perfect purity! This can only work if we
enforce a number of clusters. Creating an optimal clustering is not
always simple due to the large number of permutations, instead we
create an intuitive cluster and then iteratively improve it. - Thomas
Hayden
near node was allowed, should I. VS Complete link - I can't let
another in with much better nodes. Purity if cluster problem - Because
we could specify number of clusters equal to the number of elements,
and this would give us perfect purity! This can only work if we
enforce a number of clusters. Creating an optimal clustering is not
always simple due to the large number of permutations, instead we
create an intuitive cluster and then iteratively improve it. - Thomas
Hayden
10/20/2011
Assuming clusters are spherical in vector space in K-Means leads to sensitivity to co-ordinate changes and weighting.
--
Avinash Bhashyam
Avinash Bhashyam
10/20/2011
User needs snippets to determine the better
--
Nikhil Pratap
cluster. Can make snippets using the most frequently occurring
words. Use word frequency in cluster but still need an
inverse cluster frequency value (eliminate super common words in all clusters).
Cluster Snippet = top k words with highest tf*icf values from this cluster.
Nikhil Pratap
10/20/11
Clustering on High Dimensional data
like Documents is tricky as most document
pairs have a similarity distance approaching 0.
The cosine theta distance is not really
a good measure, so use the costly LSI
to reduce dimensions. The true distance
is then represented in reduced dim. space.
Manjara did this but its not practical. yet.
M.
10/20/2011
Bisecting K-means algorithm solves two issues of k-means
1. No need to know the value of K at the beginning
2. Solves the local minima problem by considering multiple pairs for identifying the initial cluster
-Bharath
10/20/2011
Although Agglomerative Hierarchical clustering(AHC) can produce ordering of objects one of the major drawbacks could be if an object is placed incorrectly in other group initially then there is no provision for relocation of object.
-----Abilash
10/20/2011
K means is a disjoint and exhaustive outlier problem. One data point cannot be a a part of two clusters.
-Archana
Sunday, October 23, 2011
10/20/2011
K-means is the cluster mean and it represents the dissimilarity among clusters. It's a measure. And one of the problem it faces is that when we increase K dissimilarity falls; this could lead K to be the number of points/nodes in the entire problem. One of the solutions for this is sampling instead of all, for example, 3 trillion books; sampling with different K and use the best. The other solution could be to penalize when increasing the number of clusters usage.
--
Ivan Zhou
Ivan Zhou
Graduate Student
Graduate Professional Student Association (GPSA) Assembly Member
School of Computing, Informatics and Decision Systems Engineering
Ira A. Fulton School of Engineering
Arizona State University
10/20/2011
K-Means Problems:
1. K - no. of clusters needs to be specifed upfront
2. K-means is sensitive to the initial seeds picked.
Advantages of K-means:
Very fast, linear in all the relevant factors.
Ramya
1. K - no. of clusters needs to be specifed upfront
2. K-means is sensitive to the initial seeds picked.
Advantages of K-means:
Very fast, linear in all the relevant factors.
Ramya
Saturday, October 22, 2011
10/20/11
Reiterating the point raised in class.
If the points on 2D plane is evenly distributed to form circle, then the bisecting (2 means divisive) hierarchical clustering will result in dividing the points at the diameter of the circle. After iterating this steps for n number of times, we will have 2 pow n number of clusters of equal size. If I cut the hierarchy to get odd number of of clusters then the resultant clusters will have non equal sized clusters.
--bhaskar
If the points on 2D plane is evenly distributed to form circle, then the bisecting (2 means divisive) hierarchical clustering will result in dividing the points at the diameter of the circle. After iterating this steps for n number of times, we will have 2 pow n number of clusters of equal size. If I cut the hierarchy to get odd number of of clusters then the resultant clusters will have non equal sized clusters.
--bhaskar
Friday, October 21, 2011
10/20/2011
The internal measure of clustering depends upon
- Intra cluster tightness.
- Inter cluster separation.
Preethi
10/20/2011
The disadvantage with K - means clustering algorithm is that the value of K should be known in advance and with Hierarchical Clustering is that it consumes lot of time (O(n^2)). So an another algorithm known as Buckshot algorithm is used as an hybrid of the other two which addresses both the problems.
Preethi
Thursday, October 20, 2011
10/20/2011
One way to address the problem of not knowing k in advance of running
k-means is to look for kinks in the tightness vs. K curve. However, it
might not be practical to run k-means for so many different k's on
large datasets. So one way to improve this is to take a random sample
of the dataset beforehand, running k-means with various sizes of k to
find the "best" k.
k-means is to look for kinks in the tightness vs. K curve. However, it
might not be practical to run k-means for so many different k's on
large datasets. So one way to improve this is to take a random sample
of the dataset beforehand, running k-means with various sizes of k to
find the "best" k.
- Stephen Booher
10/20/2011
In k-means, the seeds you start with may not reach the global minimum of the goodness measure.
- Elias
10/13/11
Any clustering method has an internal bias
to find a particular shape of cluster.
K-Means is looking for spheres (based on
its distance measure) so it Will find spheres.
Consequently, if you have a chain of data, k-means
may not be the best clustering method. Fortunately,
it is pretty good methodology for text documents.
M.
10/13/2011
An outlier in a neighborhood approach can be defined as the one that has less than d points with in x distance where d,x are pre-defined thresholds. Outliers are removed in pre-processing step.
-----Abilash
Wednesday, October 19, 2011
10/13/2011
Purity of clusters is sum of pure sizes / total num of elements in clusters.
Pure size of a cluster is the size of the max class in a cluster.
Purity is a simple measure to evaluate a clustering.
Pure size of a cluster is the size of the max class in a cluster.
Purity is a simple measure to evaluate a clustering.
-Stephen Booher
Tuesday, October 18, 2011
10/13/2011
Regarding cluster purity, its computation won't work as well when the number of clusters increase. Because if the number of clusters increase and reach the number of elements, then all of them would have a purity of 1. Hard cluster is a complete member assignment, and soft cluster is a partial assignment to a cluster.
--
Ivan Zhou
Ivan Zhou
Graduate Student
Graduate Professional Student Association (GPSA) Assembly Member
School of Computing, Informatics and Decision Systems Engineering
Ira A. Fulton School of Engineering
Arizona State University
Subscribe to:
Posts (Atom)