DS Unsupervised Learning and Clustering
Unsupervised learning works without labelled data. The algorithm explores raw data and discovers hidden structure, groups, or patterns on its own — without anyone telling it what to look for. Clustering is the most common unsupervised task: it groups similar data points together based on their features alone.
What Is Clustering
Clustering groups data points so that points within the same group are more similar to each other than to points in other groups. A "cluster" is a naturally occurring group in the data — and the algorithm finds these groups automatically.
Real-World Clustering Applications
| Industry | Clustering Use Case |
|---|---|
| Retail | Group customers by purchase behaviour for targeted marketing |
| Healthcare | Group patients with similar symptoms for personalised treatment |
| Finance | Detect unusual transaction clusters for fraud detection |
| Technology | Group similar documents, news articles, or web pages |
| Biology | Cluster genes with similar expression patterns |
K-Means Clustering
K-Means is the most widely used clustering algorithm. It partitions data into K groups (clusters) by iteratively assigning each point to the nearest cluster centre (centroid) and updating the centroids based on the mean of assigned points.
How K-Means Works – Step by Step
Step 1: Choose K (number of clusters)
K = 3
Step 2: Place K centroids randomly
★ ★ ★ (random starting positions)
Step 3: Assign each point to its nearest centroid
Points near ★₁ → Cluster 1
Points near ★₂ → Cluster 2
Points near ★₃ → Cluster 3
Step 4: Recalculate centroids as the mean of each cluster
New ★ = average position of all points in that cluster
Step 5: Repeat Steps 3 and 4 until centroids stop moving
(convergence)
Step 6: Final clusters are the stable assignments
Diagram – K-Means Iteration
Iteration 1: Iteration 2: Converged:
★ ★ ★ ★ ✦
(random centroids) ● ● ★ ● ● ✦
● ● ● ● ●
● ● ● ● ★ ●
● ● ○ ○ ○ ○ ✦ ○ ○ ○ ○
○ ○ ○ ○ ○ ○ ○ ✦ ○ ○ ○
□ □ □ □ □ □ □ □ □ ✦
Centroids move closer to the true centre of each group each iteration.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
# Generate synthetic customer data
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
df = pd.DataFrame(X, columns=["AnnualSpend", "PurchaseFrequency"])
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# Train K-Means with K=4
kmeans = KMeans(n_clusters=4, n_init=10, random_state=42)
kmeans.fit(X_scaled)
# Assign cluster labels
df["Cluster"] = kmeans.labels_
print("Cluster counts:")
print(df["Cluster"].value_counts().sort_index())
print("\nCluster Centres (original scale):")
centres = scaler.inverse_transform(kmeans.cluster_centers_)
for i, c in enumerate(centres):
print(f" Cluster {i}: Spend={c[0]:.1f}, Frequency={c[1]:.1f}")
Choosing the Right K – The Elbow Method
The Elbow Method runs K-Means for multiple values of K and plots the inertia (total within-cluster sum of squared distances). The optimal K sits at the "elbow" of the curve — where adding more clusters stops significantly reducing inertia.
inertia = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X_scaled)
inertia.append(km.inertia_)
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertia, "bo-", markersize=8)
plt.title("Elbow Method – Optimal Number of Clusters")
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia (Within-cluster Sum of Squares)")
plt.xticks(K_range)
plt.tight_layout()
plt.savefig("elbow_method.png")
plt.show()
Diagram – Reading the Elbow Curve
Inertia
│
│ \
│ \
│ \
│ \
│ ╲
│ ╰──────────────────── (flattens out)
└──────────────────────────→ K
↑
Elbow point = Best K
Before the elbow: each additional cluster reduces inertia a lot
After the elbow: adding clusters gives diminishing returns
Silhouette Score – Evaluating Cluster Quality
The Silhouette Score measures how well each point fits its assigned cluster versus its nearest neighbouring cluster. It ranges from -1 to 1. A score close to 1 means clusters are well-separated and tight. A score near 0 means clusters overlap. A negative score means a point is likely in the wrong cluster.
from sklearn.metrics import silhouette_score
sil_scores = []
for k in range(2, 11):
km = KMeans(n_clusters=k, n_init=10, random_state=42)
lbl = km.fit_predict(X_scaled)
sil = silhouette_score(X_scaled, lbl)
sil_scores.append(sil)
print(f"K={k}: Silhouette Score = {sil:.4f}")
best_k = K_range[1:][np.argmax(sil_scores)]
print(f"\nBest K by Silhouette Score: {best_k}")
Hierarchical Clustering
Hierarchical Clustering builds a tree of clusters called a dendrogram. Agglomerative (bottom-up) hierarchical clustering starts with every point as its own cluster and merges the two closest clusters repeatedly until only one cluster remains. The dendrogram shows every possible level of clustering simultaneously — the right number of clusters comes from cutting the tree at the right height.
Diagram – Agglomerative Clustering
Start: Each point is its own cluster
A B C D E
Step 1: Merge closest pair (C and D)
A B [C-D] E
Step 2: Merge next closest (A and B)
[A-B] [C-D] E
Step 3: Merge [C-D] and E
[A-B] [C-D-E]
Step 4: Merge all into one
[A-B-C-D-E]
Dendrogram view:
┌──────────────┐
┌───┴───┐ ┌───┴────┐
┌─┴─┐ (D) ┌─┴─┐ (E)
(A) (B) (C)
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
# Use a small sample for the dendrogram
sample = X_scaled[:50]
# Compute linkage matrix
linked = linkage(sample, method="ward")
# Plot dendrogram
plt.figure(figsize=(12, 5))
dendrogram(linked, truncate_mode="lastp", p=15, leaf_rotation=45)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Cluster")
plt.ylabel("Distance")
plt.tight_layout()
plt.savefig("dendrogram.png")
plt.show()
# Apply clustering with chosen number of clusters
agg = AgglomerativeClustering(n_clusters=4, linkage="ward")
df["HierCluster"] = agg.fit_predict(X_scaled)
print("Hierarchical Cluster counts:")
print(df["HierCluster"].value_counts().sort_index())
DBSCAN – Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together and marks points in low-density areas as outliers (noise). Unlike K-Means, DBSCAN does not require specifying K in advance and handles clusters of any shape — including non-circular clusters.
Diagram – DBSCAN Concepts
ε = neighbourhood radius
MinPts = minimum points to form a cluster
● ● ●
● ● ● ← Dense region → Core Points → Cluster 1
● ● ●
○ ← Border Point (within ε of core, but not dense itself)
✕ ← Noise Point (isolated, not in any cluster)
● ● ●
● ● ← Dense region → Cluster 2
● ●
DBSCAN discovers both clusters without being told K=2
DBSCAN labels ✕ as noise (-1) automatically
from sklearn.cluster import DBSCAN
# Generate data with noise
np.random.seed(0)
from sklearn.datasets import make_moons
X_moon, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X_moon)
# Results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters found : {n_clusters}")
print(f"Noise points : {n_noise}")
print(f"Cluster labels : {set(labels)}")
Comparing Clustering Algorithms
| Algorithm | Needs K? | Cluster Shape | Handles Noise | Best For |
|---|---|---|---|---|
| K-Means | Yes | Circular only | No | Large datasets, well-separated spherical clusters |
| Hierarchical | No (choose from dendrogram) | Any shape | No | Small datasets, visualising cluster hierarchy |
| DBSCAN | No | Any shape | Yes (marks as -1) | Geographical data, outlier detection |
Practical Example – Customer Segmentation
# Real-world style: segment customers for a retail business
np.random.seed(99)
customers = pd.DataFrame({
"Annual_Income": np.random.normal(50000, 20000, 200).round(0),
"Spend_Score": np.random.randint(1, 100, 200),
"Purchase_Freq": np.random.randint(1, 52, 200)
})
# Scale
scaler_c = StandardScaler()
X_cust = scaler_c.fit_transform(customers)
# K-Means with K=5
km5 = KMeans(n_clusters=5, n_init=10, random_state=42)
customers["Segment"] = km5.fit_predict(X_cust)
# Profile each segment
profile = customers.groupby("Segment").mean().round(0)
print("Customer Segment Profiles:")
print(profile)
Example Output – Segment Profiles:
Annual_Income Spend_Score Purchase_Freq
Segment
0 28312.0 73.0 38.0 ← Low income, high spenders
1 72105.0 82.0 45.0 ← High income, high spenders
2 49820.0 48.0 26.0 ← Mid income, average behaviour
3 32418.0 25.0 10.0 ← Low income, low engagement
4 68234.0 20.0 12.0 ← High income, rarely shop
Summary
- Clustering groups similar data points together without needing any labels
- K-Means partitions data into K clusters by iteratively moving centroids to the mean of each cluster
- The Elbow Method and Silhouette Score help choose the optimal value of K
- Hierarchical Clustering builds a dendrogram that shows all possible clusterings at once
- DBSCAN finds clusters of any shape and automatically labels outliers as noise
- Customer segmentation is one of the most common real-world applications of clustering
- Always scale features before clustering — algorithms that use distance are sensitive to feature scale
