Machine Learning Hyperparameter Tuning
Every Machine Learning algorithm has settings that control how it learns. These settings are called hyperparameters. Unlike model parameters (which the algorithm learns from data), hyperparameters are set by the practitioner before training begins. Choosing the right hyperparameters can dramatically improve model performance.
Parameters vs Hyperparameters
┌──────────────────────────┬────────────────────────────────────────┐ │ Parameters │ Hyperparameters │ ├──────────────────────────┼────────────────────────────────────────┤ │ Learned from data │ Set before training begins │ │ Change during training │ Stay fixed during training │ │ Stored in the model │ Control how training happens │ │ Example: weights in │ Example: learning rate, max_depth, │ │ Linear Regression │ number of trees, K in KNN │ └──────────────────────────┴────────────────────────────────────────┘ Analogy: Building a house: Parameters = bricks, beams (chosen based on the design) Hyperparameters = blueprint rules (set before construction)
Common Hyperparameters by Algorithm
┌──────────────────────┬──────────────────────────────────────────┐ │ Algorithm │ Key Hyperparameters │ ├──────────────────────┼──────────────────────────────────────────┤ │ Decision Tree │ max_depth, min_samples_split, │ │ │ min_samples_leaf, criterion │ │ Random Forest │ n_estimators, max_depth, max_features │ │ SVM │ C, kernel, gamma │ │ KNN │ n_neighbors, distance metric │ │ Linear Regression │ fit_intercept, normalization │ │ Logistic Regression │ C (regularization), solver, max_iter │ │ Neural Network │ learning_rate, num_layers, num_neurons, │ │ │ batch_size, epochs, dropout_rate │ │ XGBoost │ n_estimators, max_depth, learning_rate, │ │ │ subsample, colsample_bytree │ └──────────────────────┴──────────────────────────────────────────┘
Method 1: Grid Search
Grid Search tries EVERY possible combination of hyperparameter values from a predefined list. Example: Tuning Random Forest Parameter Grid: n_estimators: [50, 100, 200] max_depth: [3, 5, 10, None] max_features: ['sqrt', 'log2'] Total combinations = 3 × 4 × 2 = 24 Grid Search trains and evaluates a model for ALL 24 combinations using cross-validation, then returns the best combination. ┌──────────────┬───────────┬──────────────┬──────────────────┐ │ n_estimators │ max_depth │ max_features │ CV Accuracy │ ├──────────────┼───────────┼──────────────┼──────────────────┤ │ 50 │ 3 │ sqrt │ 81.2% │ │ 50 │ 5 │ sqrt │ 84.5% │ │ 100 │ 5 │ sqrt │ 87.1% ← Best │ │ 200 │ 5 │ sqrt │ 87.0% │ │ 100 │ 10 │ log2 │ 85.6% │ │ ... │ ... │ ... │ ... │ └──────────────┴───────────┴──────────────┴──────────────────┘ Best: n_estimators=100, max_depth=5, max_features='sqrt' → 87.1% Downside: With many hyperparameters and many values, the number of combinations explodes. 10 hyperparameters × 5 values each = 5^10 = ~10 million combinations Grid Search becomes too slow for complex models.
Method 2: Random Search
Instead of trying every combination, Random Search randomly samples a fixed number of combinations from the hyperparameter space. Same parameter grid: n_estimators: [50, 100, 200, 300, 500] max_depth: [3, 5, 7, 10, 15, None] max_features: ['sqrt', 'log2', 0.5, 0.8] Total possible combinations = 5 × 6 × 4 = 120 Random Search with n_iter=20: Tries only 20 randomly chosen combinations instead of all 120. Why it works: Most hyperparameter value ranges have diminishing returns. A few values at the extremes dominate. Random Search covers a wider range with fewer trials. Research shows: Random Search with 60 trials often matches Grid Search with 1000+ trials in practice. ┌────────────────────────────┬──────────────┬───────────────────┐ │ Feature │ Grid Search │ Random Search │ ├────────────────────────────┼──────────────┼───────────────────┤ │ Covers all combinations? │ Yes │ No (random sample)│ │ Time required │ High │ Much lower │ │ Best for small grids? │ Yes │ Yes │ │ Best for large search space│ No │ Yes │ │ Guaranteed to find best? │ Yes │ No (probabilistic)│ └────────────────────────────┴──────────────┴───────────────────┘
Method 3: Bayesian Optimization
Bayesian Optimization is smarter than both Grid and Random Search.
It learns from previous trials to guide the next trial.
How it works:
Step 1: Try a few random combinations (exploration)
Step 2: Build a model of which regions of the search space
are likely to contain the best results
Step 3: Try the next combination in the most promising region
Step 4: Update the model and repeat
Analogy:
Searching for gold in a mountain range:
Random Search: dig random holes everywhere
Bayesian: dig a few holes, notice gold near the south ridge,
dig more holes near south ridge, refine from there
Result: Finds good hyperparameters with fewer trials than
Grid or Random Search — especially valuable when
each trial takes hours (e.g., deep learning models).
Popular libraries: Optuna, Hyperopt, Scikit-Optimize
Cross-Validation During Tuning
NEVER evaluate hyperparameters on the test set.
Always use cross-validation on the training set only.
Correct workflow:
1. Set aside Test Set (do not touch until final evaluation)
2. Apply Grid/Random/Bayesian search on Training Set
using K-Fold Cross Validation
3. Select best hyperparameters based on CV score
4. Retrain final model on full Training Set with best params
5. Evaluate ONCE on Test Set → report final performance
Wrong workflow (data leakage):
1. Tune hyperparameters by evaluating on Test Set
2. Pick settings that work best on Test Set
→ Test Set is no longer a fair judge of new-data performance
Learning Rate: A Critical Hyperparameter
The learning rate controls how big each adjustment step is during Gradient Descent training. Visual: Error │ │● Too high learning rate: │ ● ←── overshoots, oscillates, never converges │ ●●●●●●●● │ │ │● Good learning rate: │ ● smooth descent to minimum │ ● │ ● │ ●●●●●● Learning Rate Values: 0.1 → Often too large, oscillates 0.01 → Common good default 0.001 → Slower but more stable 0.0001→ Very slow, good for fine-tuning Too high: Model never converges (error bounces around) Too low: Model learns very slowly, may get stuck
Hyperparameter Tuning Workflow
Start with Default Hyperparameters
│
▼
Train Model → Evaluate with Cross Validation
│
▼
Is performance acceptable?
│
├── Yes → Proceed to Test Set Evaluation
│
└── No → Define Search Space
│
▼
Choose Search Strategy
(Grid / Random / Bayesian)
│
▼
Run Search with Cross Validation
│
▼
Select Best Hyperparameters
│
▼
Retrain on Full Training Set
│
▼
Evaluate on Test Set → Done ✓
Practical Tips
1. Start with defaults: Most libraries (scikit-learn, XGBoost) have well-chosen defaults. Default Random Forest often outperforms a poorly tuned Decision Tree. 2. Tune the most impactful hyperparameters first: Random Forest: n_estimators and max_depth matter most. XGBoost: learning_rate and n_estimators matter most. SVM: C and gamma matter most. 3. Use logarithmic scales for learning rate and C: Search: [0.0001, 0.001, 0.01, 0.1, 1, 10, 100] Not: [0.1, 0.2, 0.3, 0.4, 0.5] (too narrow a range) 4. More data beats more tuning: Collecting 2× more training data often helps more than hours of hyperparameter search.
