Cluster Analysis

Clusters
  • Homogenous within
  • Heterogeneous across
  • Based on field of interest
  • There is no Objective function
  • No dependent variable
  • Popularly called subjective segmentation
  • Segmentation develops on its own based on values of the input variables
  • This is called Unsupervised learning
How is Cluster different from decision tree techniques
  • Decision tree technique requires a clearly defined objective funtion
  • this is based on identifying independent and dependent variables.
Why clustering
  • Find groups which has elements similar to other groups
  • elements of different groups are dissimilar to each other
  • Generate treatment for specific group
 Examples/Business Scenarios
  • In Marketing
    • Leading edge shoppers - looking for technological new products
    • Big brand shoppers
    • Seasonal shoppers
    • Value conscious shoppers
  • Fashion consumer groups
    • Group driven by trends
    • Group interested in Fashion
    • Group purchasing when necessary
  • Cluster analysis as part of business strategy
    • Best Buy
      • Logical manageable segment indentified
        • Short on time mother
        • Happy go lucky youngster
        •  
Segmentation - If you are not thinking segments, you are not thinking. To think segments means you have to think about what drives customers, customer groups, and the choices that are or might be available to them

Hierarchical Clustering
  • Agglomerative Clustering Algorithm
    • Compute distance matrix between input datapoints
    • Let each datapoint be a cluster
    • Repeat
      • Merge two closest clusters
      • Update distance matrix
    • Until only one cluster remains
    •  Key operation is the computation of distance between two clusters
      • Different definitions of distrance lead to different algorithms
      • Euclidean distance(straigt line distance)
        • sqrt((x2-x1)^2+(y2-y1)^2)
        • Properties
          • d(x,y)>0
          • d(x,x)=0
          • d(x,y)=d(y,x) 
          • d(x,y)<d(x,k)+d(k,y) 
        •  Intermediate state(distance between clusters)
          • Centroid method
            • Obtain mean for each cluster(centroid)
            • sigma xi/n
  • Where it can be applied
    • For N observations, it does Nc2 calculations of distance
    • Step2: it does N-1c2 calculations and so until only one datapoint is left
    • Drawbacks
      • For large datasets, it becomes computationally infeasible approach.
      • Usually this method is applied when dataset is less than 100
      • For large datasets, non-hierarchical methods like K-Means algorithm is used
Non-Hierarchical Clustering(K-Means Algorithm)
  •  Partitioning method
    • Partitioning a Database D of n objects into k clusters, such that the sum of squared distances is minimized. 
    • Decide k, the number of clusters that are needed finally
    • Given k, the k-means algorithm is implemented in 4 steps
      • Partition objects into k nonempty subsets (randomly)
      • Compute seed points as the centroids(mean) of the clusters of the current partitioning.
      • Assign each object to the cluster with the nearest seed point
      • Go back to step2, stop when the assignment doesn't change
    • Strength
      • Efficiency
    • Weakness
      • applicable only for continuous n-dimensional space
      • need to specify #clusters in advance
      • Sensitive to noise data and outliers
Linkage functions
  • functions which decide how to combine 2 clusters
  • If distance is delow a threshold then we combine clusters
  • Single Link distance
    • distance between the two most similar objects(minimum distance between any two points in two clusters)
  • Complete Link Distance
    • maximum distance between any objects in C1 and any object in C2
  • Group Average Distance
    • Average distance between any objects in C1 & any objects in C2
  • Centroid distance
  • Ward Distance
      • Difference between Total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters.
      • Less susceptible to noise & outliers
      • It is biased towards globular clusters
      • This is hierarchical analogue of K-means
Scree Plot
  • # of clusters and within cluster variance

Comments

Popular posts from this blog

Statistics in Machine Learning