DS Unsupervised Learning and Clustering

Unsupervised learning works without labelled data. The algorithm explores raw data and discovers hidden structure, groups, or patterns on its own — without anyone telling it what to look for. Clustering is the most common unsupervised task: it groups similar data points together based on their features alone.

What Is Clustering

Clustering groups data points so that points within the same group are more similar to each other than to points in other groups. A "cluster" is a naturally occurring group in the data — and the algorithm finds these groups automatically.

Real-World Clustering Applications

Industry	Clustering Use Case
Retail	Group customers by purchase behaviour for targeted marketing
Healthcare	Group patients with similar symptoms for personalised treatment
Finance	Detect unusual transaction clusters for fraud detection
Technology	Group similar documents, news articles, or web pages
Biology	Cluster genes with similar expression patterns

K-Means Clustering

K-Means is the most widely used clustering algorithm. It partitions data into K groups (clusters) by iteratively assigning each point to the nearest cluster centre (centroid) and updating the centroids based on the mean of assigned points.

How K-Means Works – Step by Step

Step 1: Choose K (number of clusters)
         K = 3

Step 2: Place K centroids randomly
        ★ ★ ★    (random starting positions)

Step 3: Assign each point to its nearest centroid
        Points near ★₁ → Cluster 1
        Points near ★₂ → Cluster 2
        Points near ★₃ → Cluster 3

Step 4: Recalculate centroids as the mean of each cluster
        New ★ = average position of all points in that cluster

Step 5: Repeat Steps 3 and 4 until centroids stop moving
        (convergence)

Step 6: Final clusters are the stable assignments

Diagram – K-Means Iteration

Iteration 1:              Iteration 2:             Converged:
★ ★ ★                   ★                         ✦
(random centroids)      ● ●  ★                  ● ● ✦
                        ● ●    ●               ● ●
● ●   ●                ●   ★                  ●
● ●                   ○ ○   ○ ○              ✦  ○ ○  ○ ○
○ ○  ○ ○              ○ ○   ○                ✦  ○ ○  ○
□ □  □                □ □   □               □ □  □ ✦

Centroids move closer to the true centre of each group each iteration.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# Generate synthetic customer data
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

df = pd.DataFrame(X, columns=["AnnualSpend", "PurchaseFrequency"])

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Train K-Means with K=4
kmeans = KMeans(n_clusters=4, n_init=10, random_state=42)
kmeans.fit(X_scaled)

# Assign cluster labels
df["Cluster"] = kmeans.labels_

print("Cluster counts:")
print(df["Cluster"].value_counts().sort_index())

print("\nCluster Centres (original scale):")
centres = scaler.inverse_transform(kmeans.cluster_centers_)
for i, c in enumerate(centres):
    print(f"  Cluster {i}: Spend={c[0]:.1f}, Frequency={c[1]:.1f}")

Choosing the Right K – The Elbow Method

The Elbow Method runs K-Means for multiple values of K and plots the inertia (total within-cluster sum of squared distances). The optimal K sits at the "elbow" of the curve — where adding more clusters stops significantly reducing inertia.

inertia = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X_scaled)
    inertia.append(km.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertia, "bo-", markersize=8)
plt.title("Elbow Method – Optimal Number of Clusters")
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia (Within-cluster Sum of Squares)")
plt.xticks(K_range)
plt.tight_layout()
plt.savefig("elbow_method.png")
plt.show()

Diagram – Reading the Elbow Curve

Inertia
  │
  │ \
  │  \
  │   \
  │    \
  │     ╲
  │      ╰──────────────────── (flattens out)
  └──────────────────────────→ K
         ↑
      Elbow point = Best K

Before the elbow: each additional cluster reduces inertia a lot
After the elbow: adding clusters gives diminishing returns

Silhouette Score – Evaluating Cluster Quality

The Silhouette Score measures how well each point fits its assigned cluster versus its nearest neighbouring cluster. It ranges from -1 to 1. A score close to 1 means clusters are well-separated and tight. A score near 0 means clusters overlap. A negative score means a point is likely in the wrong cluster.

from sklearn.metrics import silhouette_score

sil_scores = []
for k in range(2, 11):
    km  = KMeans(n_clusters=k, n_init=10, random_state=42)
    lbl = km.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, lbl)
    sil_scores.append(sil)
    print(f"K={k}: Silhouette Score = {sil:.4f}")

best_k = K_range[1:][np.argmax(sil_scores)]
print(f"\nBest K by Silhouette Score: {best_k}")

Hierarchical Clustering

Hierarchical Clustering builds a tree of clusters called a dendrogram. Agglomerative (bottom-up) hierarchical clustering starts with every point as its own cluster and merges the two closest clusters repeatedly until only one cluster remains. The dendrogram shows every possible level of clustering simultaneously — the right number of clusters comes from cutting the tree at the right height.

Diagram – Agglomerative Clustering

Start: Each point is its own cluster
  A   B   C   D   E

Step 1: Merge closest pair (C and D)
  A   B   [C-D]   E

Step 2: Merge next closest (A and B)
  [A-B]   [C-D]   E

Step 3: Merge [C-D] and E
  [A-B]   [C-D-E]

Step 4: Merge all into one
  [A-B-C-D-E]

Dendrogram view:
         ┌──────────────┐
     ┌───┴───┐      ┌───┴────┐
   ┌─┴─┐  (D)    ┌─┴─┐  (E)
  (A) (B)       (C)

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

# Use a small sample for the dendrogram
sample = X_scaled[:50]

# Compute linkage matrix
linked = linkage(sample, method="ward")

# Plot dendrogram
plt.figure(figsize=(12, 5))
dendrogram(linked, truncate_mode="lastp", p=15, leaf_rotation=45)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Cluster")
plt.ylabel("Distance")
plt.tight_layout()
plt.savefig("dendrogram.png")
plt.show()

# Apply clustering with chosen number of clusters
agg = AgglomerativeClustering(n_clusters=4, linkage="ward")
df["HierCluster"] = agg.fit_predict(X_scaled)
print("Hierarchical Cluster counts:")
print(df["HierCluster"].value_counts().sort_index())

DBSCAN – Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together and marks points in low-density areas as outliers (noise). Unlike K-Means, DBSCAN does not require specifying K in advance and handles clusters of any shape — including non-circular clusters.

Diagram – DBSCAN Concepts

ε = neighbourhood radius
MinPts = minimum points to form a cluster

● ● ●
● ● ●   ← Dense region → Core Points → Cluster 1
● ● ●

    ○ ← Border Point (within ε of core, but not dense itself)

              ✕  ← Noise Point (isolated, not in any cluster)

● ● ●
● ●     ← Dense region → Cluster 2
● ●

DBSCAN discovers both clusters without being told K=2
DBSCAN labels ✕ as noise (-1) automatically

from sklearn.cluster import DBSCAN

# Generate data with noise
np.random.seed(0)
from sklearn.datasets import make_moons

X_moon, _ = make_moons(n_samples=200, noise=0.1, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X_moon)

# Results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise    = (labels == -1).sum()

print(f"Clusters found : {n_clusters}")
print(f"Noise points   : {n_noise}")
print(f"Cluster labels : {set(labels)}")

Comparing Clustering Algorithms

Algorithm	Needs K?	Cluster Shape	Handles Noise	Best For
K-Means	Yes	Circular only	No	Large datasets, well-separated spherical clusters
Hierarchical	No (choose from dendrogram)	Any shape	No	Small datasets, visualising cluster hierarchy
DBSCAN	No	Any shape	Yes (marks as -1)	Geographical data, outlier detection

Practical Example – Customer Segmentation

# Real-world style: segment customers for a retail business
np.random.seed(99)
customers = pd.DataFrame({
    "Annual_Income":     np.random.normal(50000, 20000, 200).round(0),
    "Spend_Score":       np.random.randint(1, 100, 200),
    "Purchase_Freq":     np.random.randint(1, 52, 200)
})

# Scale
scaler_c = StandardScaler()
X_cust   = scaler_c.fit_transform(customers)

# K-Means with K=5
km5 = KMeans(n_clusters=5, n_init=10, random_state=42)
customers["Segment"] = km5.fit_predict(X_cust)

# Profile each segment
profile = customers.groupby("Segment").mean().round(0)
print("Customer Segment Profiles:")
print(profile)

Example Output – Segment Profiles:

         Annual_Income  Spend_Score  Purchase_Freq
Segment
0              28312.0         73.0           38.0   ← Low income, high spenders
1              72105.0         82.0           45.0   ← High income, high spenders
2              49820.0         48.0           26.0   ← Mid income, average behaviour
3              32418.0         25.0           10.0   ← Low income, low engagement
4              68234.0         20.0           12.0   ← High income, rarely shop

Summary

Clustering groups similar data points together without needing any labels
K-Means partitions data into K clusters by iteratively moving centroids to the mean of each cluster
The Elbow Method and Silhouette Score help choose the optimal value of K
Hierarchical Clustering builds a dendrogram that shows all possible clusterings at once
DBSCAN finds clusters of any shape and automatically labels outliers as noise
Customer segmentation is one of the most common real-world applications of clustering
Always scale features before clustering — algorithms that use distance are sensitive to feature scale

Previous lessons

Back to courses

Next lessons