ML Dimensionality Reduction and PCA

Dimensionality reduction is the process of reducing the number of features (columns) in a dataset while keeping as much useful information as possible. As datasets grow in features, models become slower, less accurate, and harder to visualize. Reducing dimensions addresses all of these problems.

The Curse of Dimensionality

Imagine trying to find a lost coin:
  1D (a line):    Searching 1 meter is easy
  2D (a floor):   Searching 1 m² is manageable
  3D (a room):    Searching 1 m³ takes time
  100D (??):      Data becomes impossibly sparse

In high-dimensional space:
  - Data points become far apart from each other
  - Distance measures lose meaning
  - Models need exponentially more data to learn patterns
  - Computation becomes very slow

This is called the "Curse of Dimensionality."
Dimensionality reduction is the cure.

Why Reduce Dimensions?

┌──────────────────────────────┬────────────────────────────────────┐
│ Problem with High Dimensions │ How Reduction Helps                │
├──────────────────────────────┼────────────────────────────────────┤
│ Slow model training          │ Fewer features = faster training   │
│ Overfitting                  │ Fewer features = less noise        │
│ Cannot visualize             │ Reduce to 2D or 3D for plots       │
│ Redundant/correlated features│ Remove duplicated information      │
│ Poor model accuracy          │ Remove noise-only features         │
└──────────────────────────────┴────────────────────────────────────┘

Types of Dimensionality Reduction

Two Main Approaches:

1. Feature Selection:
   Keep the most useful original features, discard the rest.
   No new features created — just a subset is chosen.
   Example: From 50 features, select top 10 by importance.

2. Feature Extraction (Transformation):
   Create brand new features that are combinations of originals.
   New features capture the most information in fewer dimensions.
   Example: PCA — the most popular technique.

Principal Component Analysis (PCA)

PCA is the most widely used dimensionality reduction technique. It transforms the original features into a set of new, uncorrelated features called Principal Components. These components are ranked by how much information (variance) each one captures. The first principal component captures the most variance, the second captures the next most, and so on.

Intuition: Finding the Best Viewing Angle

Imagine a 3D object (like a building):
  From straight above → you see its floor plan (2D shadow)
  From the side → you see a different 2D shape

PCA finds the angle (direction) from which the shadow
captures the MOST variation in the data.

That direction = Principal Component 1 (PC1)
The next best direction (perpendicular to PC1) = PC2
And so on...

PCA Example: 2D to 1D

Original Data (2 features: Height, Weight):

Weight
  │
  │         ●●●
  │       ●●●●
  │     ●●●●
  │   ●●●●
  │  ●●
  └────────────────────────────► Height

Observation: Height and Weight are highly correlated.
They move together along a diagonal direction.

PCA finds this diagonal direction = PC1

After PCA (project onto PC1):
  ●─────────────────────────────►  PC1

One number now represents most of the information
that Height and Weight both carried together.
We went from 2 features to 1, losing very little.

PCA Step by Step

Step 1: Standardize all features (mean=0, std=1)
  PCA is sensitive to scale — always standardize first.

Step 2: Compute the Covariance Matrix
  Shows how much every pair of features varies together.
  High covariance = features tend to rise and fall together.

Step 3: Compute Eigenvectors and Eigenvalues
  Eigenvectors = directions of the principal components
  Eigenvalues  = amount of variance each component captures

Step 4: Sort components by Eigenvalue (highest first)
  PC1 = direction of maximum variance
  PC2 = next direction (perpendicular to PC1)
  PC3 = next (perpendicular to PC1 and PC2)
  ...

Step 5: Choose how many components to keep

Step 6: Project original data onto chosen components
  New dataset = original data × selected eigenvectors

Explained Variance: Choosing How Many Components

After PCA on a dataset with 10 original features:

┌──────────────────────┬─────────────────────┬──────────────────┐
│ Principal Component  │ Variance Explained  │ Cumulative Total │
├──────────────────────┼─────────────────────┼──────────────────┤
│ PC1                  │ 42%                 │ 42%              │
│ PC2                  │ 25%                 │ 67%              │
│ PC3                  │ 15%                 │ 82%              │
│ PC4                  │ 8%                  │ 90%              │
│ PC5                  │ 4%                  │ 94%              │
│ PC6–PC10             │ 6% total            │ 100%             │
└──────────────────────┴─────────────────────┴──────────────────┘

Decision: Keep PC1–PC4 (90% variance retained)
  Original: 10 features
  After PCA: 4 features
  Information kept: 90%
  Information lost: only 10%

This is a major reduction with minimal information loss.

Scree Plot (Visual Guide to Choosing Components)

Scree Plot — Variance vs Component Number:

Variance
Explained
  40% │●
      │  ●
  25% │     ●
  15% │        ●
   8% │           ●
   4% │              ● ● ● ● ●
      └──────────────────────────►
      PC1 PC2 PC3 PC4 PC5  ...

Cut point: Where the line "bends" (the elbow)
→ Here: after PC4
→ Keep PC1 to PC4

What PCA Components Mean

Original features: Height, Weight, Waist, Hip, Chest (body measurements)

After PCA:
  PC1 might represent: "Overall body size" (all measurements high or low together)
  PC2 might represent: "Body shape ratio" (upper vs lower body proportions)

PCA components do NOT have simple, direct interpretations.
They are mathematical combinations of original features.
This loss of interpretability is a trade-off for fewer dimensions.

PCA vs Feature Selection

┌──────────────────────────┬────────────────┬────────────────────┐
│ Aspect                   │ Feature Select │ PCA                │
├──────────────────────────┼────────────────┼────────────────────┤
│ Original features kept?  │ Yes (subset)   │ No (new components)│
│ Interpretability         │ High           │ Low                │
│ Handles correlation?     │ Partially      │ Yes (fully)        │
│ Always reduces dims?     │ Yes            │ Yes                │
│ Requires scaling?        │ No             │ Yes (mandatory)    │
│ Works on any data type?  │ Yes            │ Numerical only     │
└──────────────────────────┴────────────────┴────────────────────┘

Other Dimensionality Reduction Methods

┌───────────────┬─────────────────────────────────────────────────┐
│ Method        │ Best For                                        │
├───────────────┼─────────────────────────────────────────────────┤
│ t-SNE         │ Visualization only (2D or 3D plots)            │
│               │ Very good at showing cluster structure          │
│               │ Cannot be used for model features               │
│ LDA (Linear   │ Supervised dimensionality reduction             │
│ Discriminant) │ Maximizes class separation (needs labels)       │
│ Autoencoders  │ Neural network-based compression                │
│               │ Handles complex, non-linear relationships       │
└───────────────┴─────────────────────────────────────────────────┘

Full PCA Workflow

High-Dimensional Dataset
        │
        ▼
Standardize Features (zero mean, unit variance)
        │
        ▼
Compute Covariance Matrix
        │
        ▼
Compute Eigenvectors and Eigenvalues
        │
        ▼
Sort by Explained Variance → Plot Scree Plot
        │
        ▼
Choose number of components (e.g., 95% variance threshold)
        │
        ▼
Project Data → Reduced Dataset
        │
        ▼
Train Model on Reduced Dataset (faster, less overfit)

Previous lessons

Back to courses

Next lessons