Machine Learning Regularization L1 and L2

Regularization adds a penalty to the model's cost function to prevent it from growing too complex. It discourages the model from assigning very large values to feature weights (coefficients), which is a hallmark of overfitting. The two most common forms are L1 (Lasso) and L2 (Ridge) regularization.

Why Large Coefficients Cause Overfitting

In Linear Regression, the model learns a formula like:
  Price = 1.2×Size + 0.8×Rooms + 150×Noise_Feature + 5000

If "Noise_Feature" is random data, the model assigns it a
large coefficient (150) because it accidentally helps on training data.
On test data, this large coefficient causes wild, wrong predictions.

Regularization penalizes large coefficients → they stay small
→ less sensitivity to noise → better generalization.

L2 Regularization (Ridge)

Cost Function with L2:
  Total Cost = Original Loss + λ × Sum of (all coefficients²)

Effect:
  All coefficients get smaller, but none reach exactly zero.
  Every feature still contributes, just with reduced influence.

λ (lambda) = regularization strength:
  λ = 0  → No regularization (original model)
  λ small → Light penalty (slight shrinkage)
  λ large → Heavy penalty (all coefficients near zero)

Example:
  Without Ridge: coefficients = [150, 0.8, 1.2, 0.5, 200]
  With Ridge (λ=1): coefficients ≈ [12, 0.75, 1.1, 0.45, 18]

The noise feature dropped from 150 to 12 — still in the model
but much less influential.

Best for: All features are somewhat relevant.
          Want to keep all features but reduce their impact.

L1 Regularization (Lasso)

Cost Function with L1:
  Total Cost = Original Loss + λ × Sum of |all coefficients|

Effect:
  Some coefficients are pushed to EXACTLY zero.
  Those features are completely removed from the model.
  L1 performs automatic feature selection.

Example:
  Without Lasso: coefficients = [150, 0.8, 1.2, 0.5, 200]
  With Lasso (λ=1): coefficients ≈ [0, 0.6, 1.0, 0, 0]

Features 1, 4, and 5 were eliminated (set to zero).
Only features 2 and 3 remain in the model.

Best for: Many features, only a few are truly relevant.
          Want a sparse, interpretable model.

L1 vs L2 Comparison

┌────────────────────────┬─────────────────────┬─────────────────────┐
│ Feature                │ L1 (Lasso)          │ L2 (Ridge)          │
├────────────────────────┼─────────────────────┼─────────────────────┤
│ Penalty term           │ |coefficient|       │ coefficient²        │
│ Coefficients → zero?   │ Yes (exactly zero)  │ No (near zero)      │
│ Feature selection?     │ Yes (automatic)     │ No (keeps all)      │
│ Best when              │ Many irrelevant      │ All features matter │
│                        │ features exist       │ roughly equally     │
│ Sparse model?          │ Yes                 │ No                  │
│ Algorithm name         │ Lasso Regression    │ Ridge Regression    │
└────────────────────────┴─────────────────────┴─────────────────────┘

Elastic Net: Combining L1 and L2

Elastic Net uses both penalties together:
  Total Cost = Loss + λ1 × |coeff| + λ2 × coeff²

Benefits:
  - Selects important features (from L1)
  - Handles correlated features well (from L2)
  - More stable than pure L1 when features are correlated

Use Elastic Net when:
  - Many features with some being irrelevant
  - Several correlated features exist in the dataset

Choosing Lambda

Lambda is a hyperparameter — it must be tuned.

Method: Cross Validation
  Try many values: λ = 0.001, 0.01, 0.1, 1, 10, 100
  For each λ, measure cross-validation accuracy
  Choose the λ that gives the best validation performance

Too small λ → Almost no regularization → May overfit
Too large λ → Too much shrinkage → May underfit

Regularization Beyond Linear Models

Regularization applies to many model types:

┌─────────────────────────────────┬────────────────────────────────┐
│ Model                           │ Regularization Technique       │
├─────────────────────────────────┼────────────────────────────────┤
│ Linear / Logistic Regression    │ L1 (Lasso), L2 (Ridge),        │
│                                 │ Elastic Net                    │
│ Neural Networks                 │ L2 weight decay, Dropout,      │
│                                 │ Early Stopping                 │
│ Decision Trees / Random Forest  │ max_depth, min_samples_leaf    │
│ SVM                             │ C parameter (inverse of λ)     │
└─────────────────────────────────┴────────────────────────────────┘

Previous lesson

Back to course

Next lesson