Cluster Analysis

Clusters

Homogenous within
Heterogeneous across
Based on field of interest
There is no Objective function
No dependent variable
Popularly called subjective segmentation
Segmentation develops on its own based on values of the input variables
This is called Unsupervised learning

How is Cluster different from decision tree techniques

Decision tree technique requires a clearly defined objective funtion
this is based on identifying independent and dependent variables.

Why clustering

Find groups which has elements similar to other groups
elements of different groups are dissimilar to each other
Generate treatment for specific group

Examples/Business Scenarios

In Marketing

Leading edge shoppers - looking for technological new products
Big brand shoppers
Seasonal shoppers
Value conscious shoppers

Fashion consumer groups

Group driven by trends
Group interested in Fashion
Group purchasing when necessary

Cluster analysis as part of business strategy

Best Buy

Logical manageable segment indentified

Short on time mother
Happy go lucky youngster

Segmentation - If you are not thinking segments, you are not thinking. To think segments means you have to think about what drives customers, customer groups, and the choices that are or might be available to them

Hierarchical Clustering

Agglomerative Clustering Algorithm

Compute distance matrix between input datapoints
Let each datapoint be a cluster
Repeat

Merge two closest clusters
Update distance matrix

Until only one cluster remains
Key operation is the computation of distance between two clusters

Different definitions of distrance lead to different algorithms
Euclidean distance(straigt line distance)

sqrt((x2-x1)^2+(y2-y1)^2)
Properties

d(x,y)>0
d(x,x)=0
d(x,y)=d(y,x)
d(x,y)<d(x,k)+d(k,y)

Intermediate state(distance between clusters)

Centroid method

Obtain mean for each cluster(centroid)
sigma xi/n

Where it can be applied

For N observations, it does Nc2 calculations of distance
Step2: it does N-1c2 calculations and so until only one datapoint is left
Drawbacks

For large datasets, it becomes computationally infeasible approach.
Usually this method is applied when dataset is less than 100
For large datasets, non-hierarchical methods like K-Means algorithm is used

Non-Hierarchical Clustering(K-Means Algorithm)

Partitioning method

Partitioning a Database D of n objects into k clusters, such that the sum of squared distances is minimized.
Decide k, the number of clusters that are needed finally
Given k, the k-means algorithm is implemented in 4 steps

Partition objects into k nonempty subsets (randomly)
Compute seed points as the centroids(mean) of the clusters of the current partitioning.
Assign each object to the cluster with the nearest seed point
Go back to step2, stop when the assignment doesn't change

Strength

Efficiency

Weakness

applicable only for continuous n-dimensional space
need to specify #clusters in advance
Sensitive to noise data and outliers

Linkage functions

functions which decide how to combine 2 clusters
If distance is delow a threshold then we combine clusters
Single Link distance

distance between the two most similar objects(minimum distance between any two points in two clusters)

Complete Link Distance

maximum distance between any objects in C1 and any object in C2

Group Average Distance

Average distance between any objects in C1 & any objects in C2

Centroid distance
Ward Distance

Difference between Total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters.
Less susceptible to noise & outliers
It is biased towards globular clusters
This is hierarchical analogue of K-means

Scree Plot

# of clusters and within cluster variance

Search This Blog

MachineLearning & AI

Cluster Analysis

Comments

Post a Comment

Popular posts from this blog

Statistics in Machine Learning