ML Dimensionality Reduction and PCA
Dimensionality reduction is the process of reducing the number of features (columns) in a dataset while keeping as much useful information as possible. As datasets grow in features, models become slower, less accurate, and harder to visualize. Reducing dimensions addresses all of these problems.
The Curse of Dimensionality
Imagine trying to find a lost coin: 1D (a line): Searching 1 meter is easy 2D (a floor): Searching 1 m² is manageable 3D (a room): Searching 1 m³ takes time 100D (??): Data becomes impossibly sparse In high-dimensional space: - Data points become far apart from each other - Distance measures lose meaning - Models need exponentially more data to learn patterns - Computation becomes very slow This is called the "Curse of Dimensionality." Dimensionality reduction is the cure.
Why Reduce Dimensions?
┌──────────────────────────────┬────────────────────────────────────┐ │ Problem with High Dimensions │ How Reduction Helps │ ├──────────────────────────────┼────────────────────────────────────┤ │ Slow model training │ Fewer features = faster training │ │ Overfitting │ Fewer features = less noise │ │ Cannot visualize │ Reduce to 2D or 3D for plots │ │ Redundant/correlated features│ Remove duplicated information │ │ Poor model accuracy │ Remove noise-only features │ └──────────────────────────────┴────────────────────────────────────┘
Types of Dimensionality Reduction
Two Main Approaches: 1. Feature Selection: Keep the most useful original features, discard the rest. No new features created — just a subset is chosen. Example: From 50 features, select top 10 by importance. 2. Feature Extraction (Transformation): Create brand new features that are combinations of originals. New features capture the most information in fewer dimensions. Example: PCA — the most popular technique.
Principal Component Analysis (PCA)
PCA is the most widely used dimensionality reduction technique. It transforms the original features into a set of new, uncorrelated features called Principal Components. These components are ranked by how much information (variance) each one captures. The first principal component captures the most variance, the second captures the next most, and so on.
Intuition: Finding the Best Viewing Angle
Imagine a 3D object (like a building): From straight above → you see its floor plan (2D shadow) From the side → you see a different 2D shape PCA finds the angle (direction) from which the shadow captures the MOST variation in the data. That direction = Principal Component 1 (PC1) The next best direction (perpendicular to PC1) = PC2 And so on...
PCA Example: 2D to 1D
Original Data (2 features: Height, Weight): Weight │ │ ●●● │ ●●●● │ ●●●● │ ●●●● │ ●● └────────────────────────────► Height Observation: Height and Weight are highly correlated. They move together along a diagonal direction. PCA finds this diagonal direction = PC1 After PCA (project onto PC1): ●─────────────────────────────► PC1 One number now represents most of the information that Height and Weight both carried together. We went from 2 features to 1, losing very little.
PCA Step by Step
Step 1: Standardize all features (mean=0, std=1) PCA is sensitive to scale — always standardize first. Step 2: Compute the Covariance Matrix Shows how much every pair of features varies together. High covariance = features tend to rise and fall together. Step 3: Compute Eigenvectors and Eigenvalues Eigenvectors = directions of the principal components Eigenvalues = amount of variance each component captures Step 4: Sort components by Eigenvalue (highest first) PC1 = direction of maximum variance PC2 = next direction (perpendicular to PC1) PC3 = next (perpendicular to PC1 and PC2) ... Step 5: Choose how many components to keep Step 6: Project original data onto chosen components New dataset = original data × selected eigenvectors
Explained Variance: Choosing How Many Components
After PCA on a dataset with 10 original features: ┌──────────────────────┬─────────────────────┬──────────────────┐ │ Principal Component │ Variance Explained │ Cumulative Total │ ├──────────────────────┼─────────────────────┼──────────────────┤ │ PC1 │ 42% │ 42% │ │ PC2 │ 25% │ 67% │ │ PC3 │ 15% │ 82% │ │ PC4 │ 8% │ 90% │ │ PC5 │ 4% │ 94% │ │ PC6–PC10 │ 6% total │ 100% │ └──────────────────────┴─────────────────────┴──────────────────┘ Decision: Keep PC1–PC4 (90% variance retained) Original: 10 features After PCA: 4 features Information kept: 90% Information lost: only 10% This is a major reduction with minimal information loss.
Scree Plot (Visual Guide to Choosing Components)
Scree Plot — Variance vs Component Number:
Variance
Explained
40% │●
│ ●
25% │ ●
15% │ ●
8% │ ●
4% │ ● ● ● ● ●
└──────────────────────────►
PC1 PC2 PC3 PC4 PC5 ...
Cut point: Where the line "bends" (the elbow)
→ Here: after PC4
→ Keep PC1 to PC4
What PCA Components Mean
Original features: Height, Weight, Waist, Hip, Chest (body measurements) After PCA: PC1 might represent: "Overall body size" (all measurements high or low together) PC2 might represent: "Body shape ratio" (upper vs lower body proportions) PCA components do NOT have simple, direct interpretations. They are mathematical combinations of original features. This loss of interpretability is a trade-off for fewer dimensions.
PCA vs Feature Selection
┌──────────────────────────┬────────────────┬────────────────────┐ │ Aspect │ Feature Select │ PCA │ ├──────────────────────────┼────────────────┼────────────────────┤ │ Original features kept? │ Yes (subset) │ No (new components)│ │ Interpretability │ High │ Low │ │ Handles correlation? │ Partially │ Yes (fully) │ │ Always reduces dims? │ Yes │ Yes │ │ Requires scaling? │ No │ Yes (mandatory) │ │ Works on any data type? │ Yes │ Numerical only │ └──────────────────────────┴────────────────┴────────────────────┘
Other Dimensionality Reduction Methods
┌───────────────┬─────────────────────────────────────────────────┐ │ Method │ Best For │ ├───────────────┼─────────────────────────────────────────────────┤ │ t-SNE │ Visualization only (2D or 3D plots) │ │ │ Very good at showing cluster structure │ │ │ Cannot be used for model features │ │ LDA (Linear │ Supervised dimensionality reduction │ │ Discriminant) │ Maximizes class separation (needs labels) │ │ Autoencoders │ Neural network-based compression │ │ │ Handles complex, non-linear relationships │ └───────────────┴─────────────────────────────────────────────────┘
Full PCA Workflow
High-Dimensional Dataset
│
▼
Standardize Features (zero mean, unit variance)
│
▼
Compute Covariance Matrix
│
▼
Compute Eigenvectors and Eigenvalues
│
▼
Sort by Explained Variance → Plot Scree Plot
│
▼
Choose number of components (e.g., 95% variance threshold)
│
▼
Project Data → Reduced Dataset
│
▼
Train Model on Reduced Dataset (faster, less overfit)
