Machine Learning Hierarchical Clustering

Hierarchical Clustering groups data by building a tree of clusters called a dendrogram. Unlike K-Means, it does not require specifying the number of clusters upfront. The dendrogram shows how data points merge (or split) at every level of similarity, letting the user choose the number of clusters after viewing the full picture.

Two Types of Hierarchical Clustering

┌──────────────────────┬──────────────────────────────────────────┐
│ Type                 │ How it Works                             │
├──────────────────────┼──────────────────────────────────────────┤
│ Agglomerative        │ Bottom-up: Start with each point as its  │
│ (most common)        │ own cluster. Merge the two closest       │
│                      │ clusters at each step. Stop when one     │
│                      │ cluster remains.                         │
│ Divisive             │ Top-down: Start with all data in one     │
│                      │ cluster. Split the least similar records │
│                      │ until each point is its own cluster.     │
└──────────────────────┴──────────────────────────────────────────┘

Agglomerative Clustering Step by Step

6 Data Points: A, B, C, D, E, F

Step 1: Every point is its own cluster
  {A}  {B}  {C}  {D}  {E}  {F}

Step 2: Find closest pair → Merge them
  A and B are closest → Merge: {A,B}  {C}  {D}  {E}  {F}

Step 3: Find next closest pair → Merge
  D and E are closest → Merge: {A,B}  {C}  {D,E}  {F}

Step 4: Continue merging
  {A,B} and C are close → {A,B,C}  {D,E}  {F}

Step 5: Merge again
  {D,E} and F merge → {D,E,F}

Step 6: Final merge
  {A,B,C} and {D,E,F} → {A,B,C,D,E,F}

Result: Dendrogram (tree diagram)

The Dendrogram: Reading the Tree

Dendrogram for 6 Points:

Height
(dissimilarity)
     │
  10 │         ┌─────────────────────────┐
     │         │                         │
   7 │    ┌────┴────┐              ┌─────┴─────┐
     │    │         │              │            │
   4 │  ┌─┴─┐     [C]          ┌──┴──┐        [F]
     │  │   │                  │     │
   2 │ [A] [B]               [D]   [E]
     │
     └─────────────────────────────────────────►
             A    B    C    D    E    F

Reading:
  Height 2: A and B merge (very similar)
  Height 4: D and E merge (very similar)
  Height 7: {A,B} and C merge | {D,E} and F merge
  Height 10: All 6 points in one cluster

Choosing K clusters:
  Cut at height 7 → 2 clusters: {A,B,C} and {D,E,F}
  Cut at height 4 → 3 clusters: {A,B}, {C}, {D,E,F}
  Cut at height 2 → 5 clusters: {A,B}, {C}, {D,E}, {F}...

Linkage Criteria: How to Measure Cluster Distance

When two clusters exist, how is the distance between them measured?

┌───────────────────┬──────────────────────────────────────────────┐
│ Linkage Type      │ Distance Definition                          │
├───────────────────┼──────────────────────────────────────────────┤
│ Single Linkage    │ Distance between the two CLOSEST points      │
│                   │ from each cluster (minimum)                  │
│ Complete Linkage  │ Distance between the two FARTHEST points     │
│                   │ from each cluster (maximum)                  │
│ Average Linkage   │ Average distance between all pairs of points │
│                   │ across the two clusters                      │
│ Ward Linkage      │ Minimizes total within-cluster variance      │
│ (most popular)    │ after merge — produces compact clusters      │
└───────────────────┴──────────────────────────────────────────────┘

K-Means vs Hierarchical Clustering

┌──────────────────────────┬───────────────┬─────────────────────┐
│ Feature                  │ K-Means       │ Hierarchical        │
├──────────────────────────┼───────────────┼─────────────────────┤
│ K required upfront?      │ Yes           │ No                  │
│ Output                   │ Flat clusters │ Tree (dendrogram)   │
│ Speed                    │ Fast          │ Slow on large data  │
│ Scalability              │ Large datasets│ Small-medium only   │
│ Cluster shape            │ Spherical     │ More flexible       │
│ Reproducible?            │ No (random)   │ Yes (deterministic) │
│ Handles outliers         │ Poorly        │ Poorly              │
│ Interpretability         │ Moderate      │ High (visual tree)  │
└──────────────────────────┴───────────────┴─────────────────────┘

When to Use Each

Use K-Means When:
  ✓ Dataset is large (thousands to millions of records)
  ✓ Approximate clusters are acceptable
  ✓ Speed matters more than perfect cluster shapes

Use Hierarchical Clustering When:
  ✓ Dataset is small (under 10,000 records)
  ✓ K is unknown and exploration is needed
  ✓ Full cluster structure (dendrogram) adds value
  ✓ Reproducibility is important

Previous lesson

Back to course

Next lesson