Cluster Analysis
Clusters
How is Cluster different from decision tree techniques- Homogenous within
- Heterogeneous across
- Based on field of interest
- There is no Objective function
- No dependent variable
- Popularly called subjective segmentation
- Segmentation develops on its own based on values of the input variables
- This is called Unsupervised learning
- Decision tree technique requires a clearly defined objective funtion
- this is based on identifying independent and dependent variables.
- Find groups which has elements similar to other groups
- elements of different groups are dissimilar to each other
- Generate treatment for specific group
- In Marketing
- Leading edge shoppers - looking for technological new products
- Big brand shoppers
- Seasonal shoppers
- Value conscious shoppers
- Fashion consumer groups
- Group driven by trends
- Group interested in Fashion
- Group purchasing when necessary
- Cluster analysis as part of business strategy
- Best Buy
- Logical manageable segment indentified
- Short on time mother
- Happy go lucky youngster
Hierarchical Clustering
- Agglomerative Clustering Algorithm
- Compute distance matrix between input datapoints
- Let each datapoint be a cluster
- Repeat
- Merge two closest clusters
- Update distance matrix
- Until only one cluster remains
- Key operation is the computation of distance between two clusters
- Different definitions of distrance lead to different algorithms
- Euclidean distance(straigt line distance)
- sqrt((x2-x1)^2+(y2-y1)^2)
- Properties
- d(x,y)>0
- d(x,x)=0
- d(x,y)=d(y,x)
- d(x,y)<d(x,k)+d(k,y)
- Intermediate state(distance between clusters)
- Centroid method
- Obtain mean for each cluster(centroid)
- sigma xi/n
- Where it can be applied
- For N observations, it does Nc2 calculations of distance
- Step2: it does N-1c2 calculations and so until only one datapoint is left
- Drawbacks
- For large datasets, it becomes computationally infeasible approach.
- Usually this method is applied when dataset is less than 100
- For large datasets, non-hierarchical methods like K-Means algorithm is used
Non-Hierarchical Clustering(K-Means Algorithm)
- Partitioning method
- Partitioning a Database D of n objects into k clusters, such that the sum of squared distances is minimized.
- Decide k, the number of clusters that are needed finally
- Given k, the k-means algorithm is implemented in 4 steps
- Partition objects into k nonempty subsets (randomly)
- Compute seed points as the centroids(mean) of the clusters of the current partitioning.
- Assign each object to the cluster with the nearest seed point
- Go back to step2, stop when the assignment doesn't change
- Strength
- Efficiency
- Weakness
- applicable only for continuous n-dimensional space
- need to specify #clusters in advance
- Sensitive to noise data and outliers
Linkage functions
- functions which decide how to combine 2 clusters
- If distance is delow a threshold then we combine clusters
- Single Link distance
- distance between the two most similar objects(minimum distance between any two points in two clusters)
- Complete Link Distance
- maximum distance between any objects in C1 and any object in C2
- Group Average Distance
- Average distance between any objects in C1 & any objects in C2
- Centroid distance
- Ward Distance
- Difference between Total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters.
- Less susceptible to noise & outliers
- It is biased towards globular clusters
- This is hierarchical analogue of K-means
Scree Plot
- # of clusters and within cluster variance
Comments
Post a Comment