DS Dimensionality Reduction with PCA

Real datasets often contain dozens or hundreds of features. Many of these features are redundant or correlated — they carry overlapping information. Dimensionality Reduction compresses the dataset into fewer, more informative features while retaining most of the original information. Principal Component Analysis (PCA) is the most widely used dimensionality reduction technique in data science.

What Is the Curse of Dimensionality

As the number of features in a dataset grows, the data becomes sparser and harder for algorithms to learn from. This phenomenon is called the Curse of Dimensionality. A simple example: with 1 feature and 10 data points, the space is reasonably covered. With 10 features, the same 10 points cover only a tiny fraction of the possible space — most machine learning algorithms perform worse as a result.

Diagram – Curse of Dimensionality

1D (1 feature):
[●  ●   ●    ●   ●  ●]
 Reasonably covered with a few points

2D (2 features):
+─────────────────────+
│    ●        ●       │
│          ●          │
│   ●                 │
│         ●     ●     │
+─────────────────────+
 Starting to spread out

3D and beyond:
Points become extremely sparse.
Distance between points increases.
Machine learning struggles.

PCA compresses many dimensions → fewer dimensions
while keeping most information.

Why Use PCA

Speed – Fewer features means faster model training and prediction
Visualisation – Compresses high-dimensional data to 2D or 3D for plotting
Noise removal – Drops low-variance components that contain mostly noise
Reduced overfitting – Fewer features reduces the chance of a model memorising noise
Removes multicollinearity – PCA components are guaranteed to be uncorrelated with each other

How PCA Works – Intuition

PCA finds new axes (called Principal Components) in the direction of maximum variance in the data. The first principal component points in the direction that contains the most variance. The second principal component is perpendicular to the first and captures the next most variance. Each subsequent component captures progressively less variance.

Diagram – PCA Concept

Original data in 2D:
     y
     │    ●  ●
     │  ●  ●
     │●  ●
     │ ●  ●
     └──────────── x

PCA finds the directions of maximum spread:

     y
     │  ╱ PC1 (most variance — diagonal direction)
     │ ╱
     │╱────────── PC2 (perpendicular, less variance)
     └──────────── x

Project data onto PC1 and PC2:
New coordinates: (PC1 value, PC2 value) for each point

If PC1 alone captures 95% of variance:
→ Drop PC2 and work in 1D instead of 2D
→ Minimal information loss

PCA Step by Step

PCA Algorithm:
┌─────────────────────────────────────┐
│ Step 1: Standardise the data        │
│         (mean=0, std=1)             │
├─────────────────────────────────────┤
│ Step 2: Compute Covariance Matrix   │
│         (measures feature correlations)│
├─────────────────────────────────────┤
│ Step 3: Find Eigenvectors &         │
│         Eigenvalues                 │
│         (directions and magnitudes  │
│          of maximum variance)       │
├─────────────────────────────────────┤
│ Step 4: Sort by Eigenvalue          │
│         (PC1 has largest eigenvalue,│
│          PC2 second largest, etc.)  │
├─────────────────────────────────────┤
│ Step 5: Project data onto top N PCs │
│         (choose N = desired dims)   │
└─────────────────────────────────────┘

PCA in Python with Scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

# Load a high-dimensional dataset (30 features)
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

print("Original shape:", X.shape)   # (569, 30)

# Step 1: Standardise
scaler  = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA – keep ALL components first to inspect variance
pca_full = PCA()
pca_full.fit(X_scaled)

# Explained variance ratio for each component
explained = pca_full.explained_variance_ratio_
cumulative = np.cumsum(explained)

print("\nVariance explained by each component:")
for i, (ev, cv) in enumerate(zip(explained[:10], cumulative[:10])):
    print(f"  PC{i+1:2d}: {ev*100:5.2f}%  (Cumulative: {cv*100:.2f}%)")

Output:

Variance explained by each component:
  PC 1: 44.27%  (Cumulative:  44.27%)
  PC 2: 18.97%  (Cumulative:  63.24%)
  PC 3: 9.39%   (Cumulative:  72.63%)
  PC 4: 6.60%   (Cumulative:  79.23%)
  PC 5: 5.50%   (Cumulative:  84.73%)
  PC 6: 4.02%   (Cumulative:  88.75%)
  PC 7: 2.25%   (Cumulative:  91.00%)
  ...
  PC10: 0.92%   (Cumulative:  95.49%)

Scree Plot – How Many Components to Keep

plt.figure(figsize=(10, 4))

# Subplot 1: Explained variance per component
plt.subplot(1, 2, 1)
plt.bar(range(1, 11), explained[:10] * 100, color="steelblue", edgecolor="black")
plt.title("Variance by Component (Scree Plot)")
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained (%)")
plt.xticks(range(1, 11))

# Subplot 2: Cumulative variance
plt.subplot(1, 2, 2)
plt.plot(range(1, 16), cumulative[:15] * 100, "bo-", markersize=7)
plt.axhline(y=95, color="red", linestyle="--", label="95% threshold")
plt.title("Cumulative Variance Explained")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance (%)")
plt.legend()

plt.tight_layout()
plt.savefig("pca_scree.png")
plt.show()

Diagram – Reading the Scree Plot

Cumulative Variance:
100%│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ────────────────
 95%│ ─ ─ ─ ─ ─ ─ ─ ───  ← Red threshold
    │             ╱
 80%│         ╱╯
    │     ╱╯
 60%│  ╱╯
    │╱
    └──────────────────────────
      1   2   3   4   5   6  ...  30
      Components

At PC=10, cumulative variance ≈ 95%.
→ Reduce 30 features to 10 with minimal information loss.

Apply PCA with Selected Components

# Reduce to 10 components (captures 95% of variance)
pca_10 = PCA(n_components=10)
X_pca  = pca_10.fit_transform(X_scaled)

print("Reduced shape:", X_pca.shape)   # (569, 10)
print(f"Total variance retained: {pca_10.explained_variance_ratio_.sum()*100:.2f}%")

Visualise Data in 2D with PCA

PCA's most popular use is reducing high-dimensional data to 2 dimensions for visualisation. The 2D scatter plot reveals whether natural groups exist in the data.

# Reduce to 2 components for visualisation
pca_2d = PCA(n_components=2)
X_2d   = pca_2d.fit_transform(X_scaled)

# Plot coloured by true label
plt.figure(figsize=(8, 6))
colors = ["tomato", "steelblue"]
labels = ["Malignant", "Benign"]

for cls, color, label in zip([0, 1], colors, labels):
    mask = y == cls
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1],
                c=color, label=label, alpha=0.6, edgecolors="black", s=50)

plt.title("PCA – Breast Cancer Data in 2D")
plt.xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)")
plt.ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)")
plt.legend()
plt.tight_layout()
plt.savefig("pca_2d_scatter.png")
plt.show()

PCA as a Preprocessing Step for Machine Learning

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Model WITHOUT PCA (all 30 features)
lr_full = LogisticRegression(max_iter=1000, random_state=42)
lr_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, lr_full.predict(X_test))

# Model WITH PCA (10 components – 95% variance)
pipeline = Pipeline([
    ("pca", PCA(n_components=10)),
    ("clf", LogisticRegression(max_iter=1000, random_state=42))
])
pipeline.fit(X_train, y_train)
acc_pca = accuracy_score(y_test, pipeline.predict(X_test))

print(f"Accuracy WITHOUT PCA (30 features): {acc_full:.4f}")
print(f"Accuracy WITH PCA   (10 components): {acc_pca:.4f}")
print(f"Feature reduction: 30 → 10 features ({(1-10/30)*100:.0f}% fewer)")

Output:

Accuracy WITHOUT PCA (30 features): 0.9737
Accuracy WITH PCA   (10 components): 0.9649
Feature reduction: 30 → 10 features (67% fewer)

→ Only 0.88% drop in accuracy for a 67% reduction in features.
→ Model trains much faster with far fewer features.

Understanding Principal Components

# Which original features contribute most to PC1?
pc1_loadings = pd.Series(
    pca_2d.components_[0],
    index=data.feature_names
).abs().sort_values(ascending=False)

print("Features most responsible for PC1:")
print(pc1_loadings.head(8))

PCA Limitations and When Not to Use It

Limitation	Reason
PCA components are harder to interpret	Each component is a combination of all original features — not a single recognisable variable
PCA captures linear relationships only	For non-linear relationships, use t-SNE or UMAP instead
PCA requires feature scaling	Always standardise before applying PCA — it is sensitive to scale differences
Information loss is unavoidable	Keeping fewer components always discards some variance from the original data

t-SNE – Non-Linear Dimensionality Reduction for Visualisation

t-SNE (t-Distributed Stochastic Neighbour Embedding) reduces data to 2D or 3D for visualisation while preserving local structure. It handles non-linear relationships better than PCA — but is only suitable for visualisation, not as a preprocessing step for models.

from sklearn.manifold import TSNE

# Apply t-SNE to 2D
tsne     = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
X_tsne   = tsne.fit_transform(X_scaled)

# Visualise
plt.figure(figsize=(8, 6))
for cls, color, label in zip([0, 1], ["tomato", "steelblue"], ["Malignant", "Benign"]):
    mask = y == cls
    plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1],
                c=color, label=label, alpha=0.7, edgecolors="black", s=50)

plt.title("t-SNE – Breast Cancer Data in 2D")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.legend()
plt.tight_layout()
plt.savefig("tsne_plot.png")
plt.show()

PCA vs t-SNE Comparison

Aspect	PCA	t-SNE
Type	Linear transformation	Non-linear transformation
Speed	Fast	Slow (not suitable for large datasets)
Use for ML preprocessing?	Yes	No — visualisation only
Interpretable components?	Partially (loadings available)	No
Preserves	Global variance	Local neighbourhood structure

Summary

High-dimensional data suffers from the Curse of Dimensionality — models perform worse as features grow
PCA compresses many features into fewer Principal Components while retaining most of the variance
The Scree Plot shows how many components are needed to explain 90% or 95% of variance
Always standardise features before applying PCA — scale differences distort the components
PCA works as a preprocessing step to speed up training and reduce overfitting
PCA components are linear combinations of all original features — they lose individual interpretability
t-SNE captures non-linear structure and produces better visualisations than PCA, but is not suitable for model preprocessing

Previous lessons

Back to courses

Next lessons