Machine Learning Regularization L1 and L2
Regularization adds a penalty to the model's cost function to prevent it from growing too complex. It discourages the model from assigning very large values to feature weights (coefficients), which is a hallmark of overfitting. The two most common forms are L1 (Lasso) and L2 (Ridge) regularization.
Why Large Coefficients Cause Overfitting
In Linear Regression, the model learns a formula like: Price = 1.2×Size + 0.8×Rooms + 150×Noise_Feature + 5000 If "Noise_Feature" is random data, the model assigns it a large coefficient (150) because it accidentally helps on training data. On test data, this large coefficient causes wild, wrong predictions. Regularization penalizes large coefficients → they stay small → less sensitivity to noise → better generalization.
L2 Regularization (Ridge)
Cost Function with L2:
Total Cost = Original Loss + λ × Sum of (all coefficients²)
Effect:
All coefficients get smaller, but none reach exactly zero.
Every feature still contributes, just with reduced influence.
λ (lambda) = regularization strength:
λ = 0 → No regularization (original model)
λ small → Light penalty (slight shrinkage)
λ large → Heavy penalty (all coefficients near zero)
Example:
Without Ridge: coefficients = [150, 0.8, 1.2, 0.5, 200]
With Ridge (λ=1): coefficients ≈ [12, 0.75, 1.1, 0.45, 18]
The noise feature dropped from 150 to 12 — still in the model
but much less influential.
Best for: All features are somewhat relevant.
Want to keep all features but reduce their impact.
L1 Regularization (Lasso)
Cost Function with L1:
Total Cost = Original Loss + λ × Sum of |all coefficients|
Effect:
Some coefficients are pushed to EXACTLY zero.
Those features are completely removed from the model.
L1 performs automatic feature selection.
Example:
Without Lasso: coefficients = [150, 0.8, 1.2, 0.5, 200]
With Lasso (λ=1): coefficients ≈ [0, 0.6, 1.0, 0, 0]
Features 1, 4, and 5 were eliminated (set to zero).
Only features 2 and 3 remain in the model.
Best for: Many features, only a few are truly relevant.
Want a sparse, interpretable model.
L1 vs L2 Comparison
┌────────────────────────┬─────────────────────┬─────────────────────┐ │ Feature │ L1 (Lasso) │ L2 (Ridge) │ ├────────────────────────┼─────────────────────┼─────────────────────┤ │ Penalty term │ |coefficient| │ coefficient² │ │ Coefficients → zero? │ Yes (exactly zero) │ No (near zero) │ │ Feature selection? │ Yes (automatic) │ No (keeps all) │ │ Best when │ Many irrelevant │ All features matter │ │ │ features exist │ roughly equally │ │ Sparse model? │ Yes │ No │ │ Algorithm name │ Lasso Regression │ Ridge Regression │ └────────────────────────┴─────────────────────┴─────────────────────┘
Elastic Net: Combining L1 and L2
Elastic Net uses both penalties together: Total Cost = Loss + λ1 × |coeff| + λ2 × coeff² Benefits: - Selects important features (from L1) - Handles correlated features well (from L2) - More stable than pure L1 when features are correlated Use Elastic Net when: - Many features with some being irrelevant - Several correlated features exist in the dataset
Choosing Lambda
Lambda is a hyperparameter — it must be tuned. Method: Cross Validation Try many values: λ = 0.001, 0.01, 0.1, 1, 10, 100 For each λ, measure cross-validation accuracy Choose the λ that gives the best validation performance Too small λ → Almost no regularization → May overfit Too large λ → Too much shrinkage → May underfit
Regularization Beyond Linear Models
Regularization applies to many model types: ┌─────────────────────────────────┬────────────────────────────────┐ │ Model │ Regularization Technique │ ├─────────────────────────────────┼────────────────────────────────┤ │ Linear / Logistic Regression │ L1 (Lasso), L2 (Ridge), │ │ │ Elastic Net │ │ Neural Networks │ L2 weight decay, Dropout, │ │ │ Early Stopping │ │ Decision Trees / Random Forest │ max_depth, min_samples_leaf │ │ SVM │ C parameter (inverse of λ) │ └─────────────────────────────────┴────────────────────────────────┘
