DS Dimensionality Reduction with PCA
Real datasets often contain dozens or hundreds of features. Many of these features are redundant or correlated — they carry overlapping information. Dimensionality Reduction compresses the dataset into fewer, more informative features while retaining most of the original information. Principal Component Analysis (PCA) is the most widely used dimensionality reduction technique in data science.
What Is the Curse of Dimensionality
As the number of features in a dataset grows, the data becomes sparser and harder for algorithms to learn from. This phenomenon is called the Curse of Dimensionality. A simple example: with 1 feature and 10 data points, the space is reasonably covered. With 10 features, the same 10 points cover only a tiny fraction of the possible space — most machine learning algorithms perform worse as a result.
Diagram – Curse of Dimensionality
1D (1 feature): [● ● ● ● ● ●] Reasonably covered with a few points 2D (2 features): +─────────────────────+ │ ● ● │ │ ● │ │ ● │ │ ● ● │ +─────────────────────+ Starting to spread out 3D and beyond: Points become extremely sparse. Distance between points increases. Machine learning struggles. PCA compresses many dimensions → fewer dimensions while keeping most information.
Why Use PCA
- Speed – Fewer features means faster model training and prediction
- Visualisation – Compresses high-dimensional data to 2D or 3D for plotting
- Noise removal – Drops low-variance components that contain mostly noise
- Reduced overfitting – Fewer features reduces the chance of a model memorising noise
- Removes multicollinearity – PCA components are guaranteed to be uncorrelated with each other
How PCA Works – Intuition
PCA finds new axes (called Principal Components) in the direction of maximum variance in the data. The first principal component points in the direction that contains the most variance. The second principal component is perpendicular to the first and captures the next most variance. Each subsequent component captures progressively less variance.
Diagram – PCA Concept
Original data in 2D:
y
│ ● ●
│ ● ●
│● ●
│ ● ●
└──────────── x
PCA finds the directions of maximum spread:
y
│ ╱ PC1 (most variance — diagonal direction)
│ ╱
│╱────────── PC2 (perpendicular, less variance)
└──────────── x
Project data onto PC1 and PC2:
New coordinates: (PC1 value, PC2 value) for each point
If PC1 alone captures 95% of variance:
→ Drop PC2 and work in 1D instead of 2D
→ Minimal information loss
PCA Step by Step
PCA Algorithm: ┌─────────────────────────────────────┐ │ Step 1: Standardise the data │ │ (mean=0, std=1) │ ├─────────────────────────────────────┤ │ Step 2: Compute Covariance Matrix │ │ (measures feature correlations)│ ├─────────────────────────────────────┤ │ Step 3: Find Eigenvectors & │ │ Eigenvalues │ │ (directions and magnitudes │ │ of maximum variance) │ ├─────────────────────────────────────┤ │ Step 4: Sort by Eigenvalue │ │ (PC1 has largest eigenvalue,│ │ PC2 second largest, etc.) │ ├─────────────────────────────────────┤ │ Step 5: Project data onto top N PCs │ │ (choose N = desired dims) │ └─────────────────────────────────────┘
PCA in Python with Scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
# Load a high-dimensional dataset (30 features)
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
print("Original shape:", X.shape) # (569, 30)
# Step 1: Standardise
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Apply PCA – keep ALL components first to inspect variance
pca_full = PCA()
pca_full.fit(X_scaled)
# Explained variance ratio for each component
explained = pca_full.explained_variance_ratio_
cumulative = np.cumsum(explained)
print("\nVariance explained by each component:")
for i, (ev, cv) in enumerate(zip(explained[:10], cumulative[:10])):
print(f" PC{i+1:2d}: {ev*100:5.2f}% (Cumulative: {cv*100:.2f}%)")
Output:
Variance explained by each component: PC 1: 44.27% (Cumulative: 44.27%) PC 2: 18.97% (Cumulative: 63.24%) PC 3: 9.39% (Cumulative: 72.63%) PC 4: 6.60% (Cumulative: 79.23%) PC 5: 5.50% (Cumulative: 84.73%) PC 6: 4.02% (Cumulative: 88.75%) PC 7: 2.25% (Cumulative: 91.00%) ... PC10: 0.92% (Cumulative: 95.49%)
Scree Plot – How Many Components to Keep
plt.figure(figsize=(10, 4))
# Subplot 1: Explained variance per component
plt.subplot(1, 2, 1)
plt.bar(range(1, 11), explained[:10] * 100, color="steelblue", edgecolor="black")
plt.title("Variance by Component (Scree Plot)")
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained (%)")
plt.xticks(range(1, 11))
# Subplot 2: Cumulative variance
plt.subplot(1, 2, 2)
plt.plot(range(1, 16), cumulative[:15] * 100, "bo-", markersize=7)
plt.axhline(y=95, color="red", linestyle="--", label="95% threshold")
plt.title("Cumulative Variance Explained")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance (%)")
plt.legend()
plt.tight_layout()
plt.savefig("pca_scree.png")
plt.show()
Diagram – Reading the Scree Plot
Cumulative Variance:
100%│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ────────────────
95%│ ─ ─ ─ ─ ─ ─ ─ ─── ← Red threshold
│ ╱
80%│ ╱╯
│ ╱╯
60%│ ╱╯
│╱
└──────────────────────────
1 2 3 4 5 6 ... 30
Components
At PC=10, cumulative variance ≈ 95%.
→ Reduce 30 features to 10 with minimal information loss.
Apply PCA with Selected Components
# Reduce to 10 components (captures 95% of variance)
pca_10 = PCA(n_components=10)
X_pca = pca_10.fit_transform(X_scaled)
print("Reduced shape:", X_pca.shape) # (569, 10)
print(f"Total variance retained: {pca_10.explained_variance_ratio_.sum()*100:.2f}%")
Visualise Data in 2D with PCA
PCA's most popular use is reducing high-dimensional data to 2 dimensions for visualisation. The 2D scatter plot reveals whether natural groups exist in the data.
# Reduce to 2 components for visualisation
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
# Plot coloured by true label
plt.figure(figsize=(8, 6))
colors = ["tomato", "steelblue"]
labels = ["Malignant", "Benign"]
for cls, color, label in zip([0, 1], colors, labels):
mask = y == cls
plt.scatter(X_2d[mask, 0], X_2d[mask, 1],
c=color, label=label, alpha=0.6, edgecolors="black", s=50)
plt.title("PCA – Breast Cancer Data in 2D")
plt.xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)")
plt.ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)")
plt.legend()
plt.tight_layout()
plt.savefig("pca_2d_scatter.png")
plt.show()
PCA as a Preprocessing Step for Machine Learning
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Model WITHOUT PCA (all 30 features)
lr_full = LogisticRegression(max_iter=1000, random_state=42)
lr_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, lr_full.predict(X_test))
# Model WITH PCA (10 components – 95% variance)
pipeline = Pipeline([
("pca", PCA(n_components=10)),
("clf", LogisticRegression(max_iter=1000, random_state=42))
])
pipeline.fit(X_train, y_train)
acc_pca = accuracy_score(y_test, pipeline.predict(X_test))
print(f"Accuracy WITHOUT PCA (30 features): {acc_full:.4f}")
print(f"Accuracy WITH PCA (10 components): {acc_pca:.4f}")
print(f"Feature reduction: 30 → 10 features ({(1-10/30)*100:.0f}% fewer)")
Output:
Accuracy WITHOUT PCA (30 features): 0.9737 Accuracy WITH PCA (10 components): 0.9649 Feature reduction: 30 → 10 features (67% fewer) → Only 0.88% drop in accuracy for a 67% reduction in features. → Model trains much faster with far fewer features.
Understanding Principal Components
# Which original features contribute most to PC1?
pc1_loadings = pd.Series(
pca_2d.components_[0],
index=data.feature_names
).abs().sort_values(ascending=False)
print("Features most responsible for PC1:")
print(pc1_loadings.head(8))
PCA Limitations and When Not to Use It
| Limitation | Reason |
|---|---|
| PCA components are harder to interpret | Each component is a combination of all original features — not a single recognisable variable |
| PCA captures linear relationships only | For non-linear relationships, use t-SNE or UMAP instead |
| PCA requires feature scaling | Always standardise before applying PCA — it is sensitive to scale differences |
| Information loss is unavoidable | Keeping fewer components always discards some variance from the original data |
t-SNE – Non-Linear Dimensionality Reduction for Visualisation
t-SNE (t-Distributed Stochastic Neighbour Embedding) reduces data to 2D or 3D for visualisation while preserving local structure. It handles non-linear relationships better than PCA — but is only suitable for visualisation, not as a preprocessing step for models.
from sklearn.manifold import TSNE
# Apply t-SNE to 2D
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)
# Visualise
plt.figure(figsize=(8, 6))
for cls, color, label in zip([0, 1], ["tomato", "steelblue"], ["Malignant", "Benign"]):
mask = y == cls
plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1],
c=color, label=label, alpha=0.7, edgecolors="black", s=50)
plt.title("t-SNE – Breast Cancer Data in 2D")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.legend()
plt.tight_layout()
plt.savefig("tsne_plot.png")
plt.show()
PCA vs t-SNE Comparison
| Aspect | PCA | t-SNE |
|---|---|---|
| Type | Linear transformation | Non-linear transformation |
| Speed | Fast | Slow (not suitable for large datasets) |
| Use for ML preprocessing? | Yes | No — visualisation only |
| Interpretable components? | Partially (loadings available) | No |
| Preserves | Global variance | Local neighbourhood structure |
Summary
- High-dimensional data suffers from the Curse of Dimensionality — models perform worse as features grow
- PCA compresses many features into fewer Principal Components while retaining most of the variance
- The Scree Plot shows how many components are needed to explain 90% or 95% of variance
- Always standardise features before applying PCA — scale differences distort the components
- PCA works as a preprocessing step to speed up training and reduce overfitting
- PCA components are linear combinations of all original features — they lose individual interpretability
- t-SNE captures non-linear structure and produces better visualisations than PCA, but is not suitable for model preprocessing
