Machine Learning Hyperparameter Tuning

Every Machine Learning algorithm has settings that control how it learns. These settings are called hyperparameters. Unlike model parameters (which the algorithm learns from data), hyperparameters are set by the practitioner before training begins. Choosing the right hyperparameters can dramatically improve model performance.

Parameters vs Hyperparameters

┌──────────────────────────┬────────────────────────────────────────┐
│ Parameters               │ Hyperparameters                        │
├──────────────────────────┼────────────────────────────────────────┤
│ Learned from data        │ Set before training begins             │
│ Change during training   │ Stay fixed during training             │
│ Stored in the model      │ Control how training happens           │
│ Example: weights in      │ Example: learning rate, max_depth,     │
│ Linear Regression        │ number of trees, K in KNN              │
└──────────────────────────┴────────────────────────────────────────┘

Analogy:
  Building a house:
  Parameters = bricks, beams (chosen based on the design)
  Hyperparameters = blueprint rules (set before construction)

Common Hyperparameters by Algorithm

┌──────────────────────┬──────────────────────────────────────────┐
│ Algorithm            │ Key Hyperparameters                      │
├──────────────────────┼──────────────────────────────────────────┤
│ Decision Tree        │ max_depth, min_samples_split,            │
│                      │ min_samples_leaf, criterion              │
│ Random Forest        │ n_estimators, max_depth, max_features    │
│ SVM                  │ C, kernel, gamma                         │
│ KNN                  │ n_neighbors, distance metric             │
│ Linear Regression    │ fit_intercept, normalization             │
│ Logistic Regression  │ C (regularization), solver, max_iter    │
│ Neural Network       │ learning_rate, num_layers, num_neurons,  │
│                      │ batch_size, epochs, dropout_rate         │
│ XGBoost              │ n_estimators, max_depth, learning_rate,  │
│                      │ subsample, colsample_bytree              │
└──────────────────────┴──────────────────────────────────────────┘

Method 1: Grid Search

Grid Search tries EVERY possible combination of hyperparameter values
from a predefined list.

Example: Tuning Random Forest

Parameter Grid:
  n_estimators: [50, 100, 200]
  max_depth:    [3, 5, 10, None]
  max_features: ['sqrt', 'log2']

Total combinations = 3 × 4 × 2 = 24

Grid Search trains and evaluates a model for ALL 24 combinations
using cross-validation, then returns the best combination.

┌──────────────┬───────────┬──────────────┬──────────────────┐
│ n_estimators │ max_depth │ max_features │ CV Accuracy      │
├──────────────┼───────────┼──────────────┼──────────────────┤
│ 50           │ 3         │ sqrt         │ 81.2%            │
│ 50           │ 5         │ sqrt         │ 84.5%            │
│ 100          │ 5         │ sqrt         │ 87.1% ← Best     │
│ 200          │ 5         │ sqrt         │ 87.0%            │
│ 100          │ 10        │ log2         │ 85.6%            │
│ ...          │ ...       │ ...          │ ...              │
└──────────────┴───────────┴──────────────┴──────────────────┘

Best: n_estimators=100, max_depth=5, max_features='sqrt' → 87.1%

Downside: With many hyperparameters and many values,
the number of combinations explodes.
10 hyperparameters × 5 values each = 5^10 = ~10 million combinations
Grid Search becomes too slow for complex models.

Method 2: Random Search

Instead of trying every combination, Random Search randomly samples
a fixed number of combinations from the hyperparameter space.

Same parameter grid:
  n_estimators: [50, 100, 200, 300, 500]
  max_depth:    [3, 5, 7, 10, 15, None]
  max_features: ['sqrt', 'log2', 0.5, 0.8]

Total possible combinations = 5 × 6 × 4 = 120

Random Search with n_iter=20:
  Tries only 20 randomly chosen combinations instead of all 120.

Why it works:
  Most hyperparameter value ranges have diminishing returns.
  A few values at the extremes dominate.
  Random Search covers a wider range with fewer trials.

Research shows:
  Random Search with 60 trials often matches
  Grid Search with 1000+ trials in practice.

┌────────────────────────────┬──────────────┬───────────────────┐
│ Feature                    │ Grid Search  │ Random Search     │
├────────────────────────────┼──────────────┼───────────────────┤
│ Covers all combinations?   │ Yes          │ No (random sample)│
│ Time required              │ High         │ Much lower        │
│ Best for small grids?      │ Yes          │ Yes               │
│ Best for large search space│ No           │ Yes               │
│ Guaranteed to find best?   │ Yes          │ No (probabilistic)│
└────────────────────────────┴──────────────┴───────────────────┘

Method 3: Bayesian Optimization

Bayesian Optimization is smarter than both Grid and Random Search.
It learns from previous trials to guide the next trial.

How it works:
  Step 1: Try a few random combinations (exploration)
  Step 2: Build a model of which regions of the search space
          are likely to contain the best results
  Step 3: Try the next combination in the most promising region
  Step 4: Update the model and repeat

Analogy:
  Searching for gold in a mountain range:
  Random Search: dig random holes everywhere
  Bayesian: dig a few holes, notice gold near the south ridge,
            dig more holes near south ridge, refine from there

Result: Finds good hyperparameters with fewer trials than
        Grid or Random Search — especially valuable when
        each trial takes hours (e.g., deep learning models).

Popular libraries: Optuna, Hyperopt, Scikit-Optimize

Cross-Validation During Tuning

NEVER evaluate hyperparameters on the test set.
Always use cross-validation on the training set only.

Correct workflow:
  1. Set aside Test Set (do not touch until final evaluation)
  2. Apply Grid/Random/Bayesian search on Training Set
     using K-Fold Cross Validation
  3. Select best hyperparameters based on CV score
  4. Retrain final model on full Training Set with best params
  5. Evaluate ONCE on Test Set → report final performance

Wrong workflow (data leakage):
  1. Tune hyperparameters by evaluating on Test Set
  2. Pick settings that work best on Test Set
  → Test Set is no longer a fair judge of new-data performance

Learning Rate: A Critical Hyperparameter

The learning rate controls how big each adjustment step is
during Gradient Descent training.

Visual:

  Error
  │
  │●                              Too high learning rate:
  │  ●    ←── overshoots, oscillates, never converges
  │      ●●●●●●●●
  │
  │
  │●                              Good learning rate:
  │  ●                            smooth descent to minimum
  │    ●
  │      ●
  │        ●●●●●●

  Learning Rate Values:
  0.1   → Often too large, oscillates
  0.01  → Common good default
  0.001 → Slower but more stable
  0.0001→ Very slow, good for fine-tuning

Too high: Model never converges (error bounces around)
Too low:  Model learns very slowly, may get stuck

Hyperparameter Tuning Workflow

Start with Default Hyperparameters
        │
        ▼
Train Model → Evaluate with Cross Validation
        │
        ▼
Is performance acceptable?
        │
        ├── Yes → Proceed to Test Set Evaluation
        │
        └── No → Define Search Space
                    │
                    ▼
               Choose Search Strategy
               (Grid / Random / Bayesian)
                    │
                    ▼
               Run Search with Cross Validation
                    │
                    ▼
               Select Best Hyperparameters
                    │
                    ▼
               Retrain on Full Training Set
                    │
                    ▼
               Evaluate on Test Set → Done ✓

Practical Tips

1. Start with defaults:
   Most libraries (scikit-learn, XGBoost) have well-chosen defaults.
   Default Random Forest often outperforms a poorly tuned Decision Tree.

2. Tune the most impactful hyperparameters first:
   Random Forest: n_estimators and max_depth matter most.
   XGBoost: learning_rate and n_estimators matter most.
   SVM: C and gamma matter most.

3. Use logarithmic scales for learning rate and C:
   Search: [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
   Not: [0.1, 0.2, 0.3, 0.4, 0.5] (too narrow a range)

4. More data beats more tuning:
   Collecting 2× more training data often helps more than
   hours of hyperparameter search.

Previous lesson

Back to course

Next lesson