DS Model Evaluation and Hyperparameter Tuning
Building a machine learning model is only half the job. Evaluating it correctly and tuning it for peak performance are equally important. This topic covers cross-validation, evaluation metrics in depth, bias-variance tradeoff, hyperparameter tuning with Grid Search and Randomised Search, and learning curves — the complete toolkit for producing reliable, production-ready models.
Why Proper Evaluation Matters
A model evaluated on the same data it trained on will always appear more accurate than it truly is. Proper evaluation uses held-out data, multiple folds, and correct metrics to give an honest picture of how the model performs on real-world, unseen data.
The Evaluation Problem
Naive Approach (Wrong): Train model on ALL data → Evaluate on SAME data → Reports 98% accuracy ↓ Real-world accuracy: 72% ← Model memorised training data Correct Approach: Training Data (80%) → Train model Test Data (20%) → Evaluate model (never used during training) ↓ Real-world accuracy: matches test accuracy ← Honest estimate
Cross-Validation – More Reliable Evaluation
A single train-test split depends heavily on which 20% of data lands in the test set. Cross-validation removes this randomness by training and evaluating the model on multiple different splits of the data and averaging the results.
K-Fold Cross-Validation
Dataset split into K=5 folds:
┌──────┬──────┬──────┬──────┬──────┐
│ F1 │ F2 │ F3 │ F4 │ F5 │
└──────┴──────┴──────┴──────┴──────┘
Round 1: [Train: F2,F3,F4,F5] [Test: F1] → Score 1
Round 2: [Train: F1,F3,F4,F5] [Test: F2] → Score 2
Round 3: [Train: F1,F2,F4,F5] [Test: F3] → Score 3
Round 4: [Train: F1,F2,F3,F5] [Test: F4] → Score 4
Round 5: [Train: F1,F2,F3,F4] [Test: F5] → Score 5
Final CV Score = Mean of Score 1 to Score 5
+ Standard Deviation (measures stability)
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 5-Fold Cross-Validation
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_scaled, y, cv=cv, scoring="accuracy")
print("CV Scores for each fold:", scores.round(4))
print(f"Mean Accuracy : {scores.mean():.4f}")
print(f"Std Deviation : {scores.std():.4f}")
print(f"95% CI : {scores.mean():.4f} ± {2*scores.std():.4f}")
Output:
CV Scores for each fold: [0.9649 0.9561 0.9737 0.9649 0.9649] Mean Accuracy : 0.9649 Std Deviation : 0.0057 95% CI : 0.9649 ± 0.0114 → Model consistently accurate (low std = stable model)
Stratified K-Fold – For Imbalanced Classes
Stratified K-Fold ensures each fold has the same proportion of each class as the full dataset. This matters when one class appears much more often than another — for example, 95% healthy vs 5% diseased.
# Stratified: preserves class distribution in each fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Check class distribution in one fold
for fold_num, (train_idx, test_idx) in enumerate(skf.split(X, y)):
y_fold = y[test_idx]
pos_rate = y_fold.mean()
print(f"Fold {fold_num+1}: Positive class = {pos_rate:.3f} ({(pos_rate*100):.1f}%)")
The Bias-Variance Tradeoff
Every machine learning model makes two types of errors: bias and variance. The tradeoff between them is the central challenge in model selection and tuning.
Definitions
Bias = Error from wrong assumptions in the model.
A high-bias model is too simple — it consistently misses the true pattern.
(Example: fitting a straight line to curved data)
Variance = Error from sensitivity to small fluctuations in training data.
A high-variance model is too complex — it memorises noise.
(Example: a deep decision tree that memorises every training point)
Diagram – Bias-Variance Tradeoff
Total Error
│ \
│ \ ← Total Error curve
│ \ ╱
│ \ ╱
│ \ ╱
│ \ ╱
│ \/ ← Sweet spot (optimal complexity)
│ ╱ \
│ ╱ \
│─────────────────────────→ Model Complexity
↑ ↑
High Bias High Variance
(Underfitting) (Overfitting)
Bias decreases as complexity grows.
Variance increases as complexity grows.
Best model sits at the balance point.
Learning Curves – Diagnosing Model Problems
A learning curve plots training and validation accuracy against the number of training samples. The shape of the curve reveals whether the model suffers from high bias, high variance, or neither.
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
def plot_learning_curve(model, X, y, title):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring="accuracy",
random_state=42
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_mean, "o-", color="steelblue", label="Training Accuracy")
plt.plot(train_sizes, val_mean, "o-", color="tomato", label="Validation Accuracy")
plt.fill_between(train_sizes,
train_scores.mean(1) - train_scores.std(1),
train_scores.mean(1) + train_scores.std(1),
alpha=0.1, color="steelblue")
plt.fill_between(train_sizes,
val_scores.mean(1) - val_scores.std(1),
val_scores.mean(1) + val_scores.std(1),
alpha=0.1, color="tomato")
plt.title(f"Learning Curve – {title}")
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy")
plt.legend()
plt.tight_layout()
plt.savefig(f"learning_curve_{title.replace(' ','_')}.png")
plt.show()
plot_learning_curve(model, X_scaled, y, "Random Forest")
Diagram – Interpreting Learning Curves
High Bias (Underfitting): High Variance (Overfitting): Accuracy Accuracy │ ─ ─ ─ ─ ─ ─ ─ 1.0 │ ─────────── Train │ Train ≈ Val │ │ ────────────── │ Val ────────── │ ─ ─ ─ ─ ─ ─ ─ 0.7 │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ └──────────────→ └──────────────→ Both lines low and close. Large gap between Train and Val. Model too simple. Model memorised training data. Well-Fitted Model: Accuracy │ ─────────────────────── Train (high) │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ Val (close to Train) └──────────────→ Both lines high, gap is small.
Hyperparameter Tuning
Hyperparameters are settings configured before training begins — they control how the algorithm learns. Tuning these settings to find the optimal combination for a specific dataset is called hyperparameter tuning.
Common Hyperparameters by Algorithm
| Algorithm | Key Hyperparameters | Effect |
|---|---|---|
| Random Forest | n_estimators, max_depth, min_samples_split | Controls tree count, depth, and minimum split size |
| SVM | C, kernel, gamma | Controls margin width, shape of boundary, kernel influence |
| KNN | n_neighbors, metric | Controls number of neighbours and distance formula |
| Logistic Regression | C, penalty, solver | Controls regularisation strength and type |
| Gradient Boosting | n_estimators, learning_rate, max_depth | Controls boosting rounds, step size, and tree complexity |
Grid Search – Exhaustive Tuning
Grid Search tries every possible combination of specified hyperparameter values. It trains and evaluates the model for each combination using cross-validation and reports the best-performing combination.
from sklearn.model_selection import GridSearchCV
# Define the hyperparameter grid
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [3, 5, 10, None],
"min_samples_split": [2, 5, 10]
}
# Grid search with 5-fold CV
grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1, # Use all CPU cores
verbose=1
)
grid_search.fit(X_scaled, y)
print("Best Parameters:", grid_search.best_params_)
print("Best CV Accuracy:", grid_search.best_score_.round(4))
Output:
Best Parameters: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 200}
Best CV Accuracy: 0.9736
Diagram – Grid Search Combinations
n_estimators: [50, 100, 200]
max_depth: [3, 5, 10, None]
min_samples_split: [2, 5, 10]
Total combinations: 3 × 4 × 3 = 36 combinations
Each tested with 5-fold CV → 36 × 5 = 180 model fits
Grid visualised (n_estimators vs max_depth, fixing min_split=2):
max_depth
3 5 10 None
n_est 50 [0.94 0.96 0.97 0.95]
100 [0.95 0.97 0.97 0.96]
200 [0.95 0.97 0.97 0.96] ← best cell here
Randomised Search – Faster Tuning for Large Grids
Randomised Search samples a fixed number of random hyperparameter combinations rather than testing all of them. For large grids, this approach finds a near-optimal solution much faster than full Grid Search.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define distributions to sample from
param_dist = {
"n_estimators": randint(50, 500),
"max_depth": [3, 5, 10, 15, 20, None],
"min_samples_split": randint(2, 20),
"min_samples_leaf": randint(1, 10),
"max_features": ["sqrt", "log2", None]
}
# Try 50 random combinations
rand_search = RandomizedSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_distributions=param_dist,
n_iter=50, # Number of combinations to try
cv=5,
scoring="accuracy",
n_jobs=-1,
random_state=42,
verbose=1
)
rand_search.fit(X_scaled, y)
print("Best Parameters:", rand_search.best_params_)
print("Best CV Accuracy:", rand_search.best_score_.round(4))
Grid Search vs Randomised Search
| Aspect | Grid Search | Randomised Search |
|---|---|---|
| Method | Tests every combination | Samples N random combinations |
| Speed | Slow for large grids | Fast — controllable via n_iter |
| Guarantee | Finds global optimum within grid | May miss the optimum (but rarely by much) |
| Best used when | Small parameter grid (<50 combos) | Large parameter space, limited compute time |
Cross-Validated Evaluation of the Best Model
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score
# Use best model found by GridSearchCV
best_model = grid_search.best_estimator_
# Evaluate with multiple metrics simultaneously
scoring_metrics = {
"accuracy": "accuracy",
"f1": make_scorer(f1_score, average="weighted"),
"precision": make_scorer(precision_score, average="weighted"),
"recall": make_scorer(recall_score, average="weighted")
}
cv_results = cross_validate(
best_model, X_scaled, y,
cv=5,
scoring=scoring_metrics
)
print("Final Model – 5-Fold CV Results:")
for metric, values in cv_results.items():
if metric.startswith("test_"):
name = metric.replace("test_", "")
print(f" {name:<12}: {values.mean():.4f} ± {values.std():.4f}")
Output:
Final Model – 5-Fold CV Results: accuracy : 0.9736 ± 0.0082 f1 : 0.9735 ± 0.0083 precision : 0.9737 ± 0.0082 recall : 0.9736 ± 0.0082
ROC Curve and AUC – For Binary Classification
The ROC (Receiver Operating Characteristic) curve plots True Positive Rate against False Positive Rate at every classification threshold. The AUC (Area Under the Curve) summarises this into a single number — a perfect model achieves AUC = 1.0.
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
best_model.fit(X_train, y_train)
y_proba = best_model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, color="steelblue", lw=2, label=f"ROC Curve (AUC = {auc_score:.4f})")
plt.plot([0,1],[0,1], "k--", lw=1, label="Random Classifier (AUC = 0.5)")
plt.title("ROC Curve – Random Forest (Best Model)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.tight_layout()
plt.savefig("roc_curve.png")
plt.show()
print(f"AUC Score: {auc_score:.4f}")
Diagram – Reading the ROC Curve
True Positive Rate (Recall)
1.0 │ ╭──────────────────
│ ╭──╯
│ ╭─╯ ← Good model (AUC ≈ 0.97)
0.5 │ ╭─╯
│ ╭─╯
│╭─╯ ← Random baseline (diagonal line)
0.0 └──────────────────────────────
0.0 0.5 1.0
False Positive Rate
AUC = 1.0 → Perfect model (top-left corner)
AUC = 0.5 → No better than random guessing (diagonal)
AUC = 0.0 → Perfectly wrong (bottom-right)
Complete Model Evaluation Checklist
| Step | Action | Tool |
|---|---|---|
| 1 | Baseline evaluation | train_test_split + accuracy_score |
| 2 | Robust evaluation | cross_val_score (5-fold) |
| 3 | Diagnose bias/variance | learning_curve() |
| 4 | Tune hyperparameters | GridSearchCV or RandomizedSearchCV |
| 5 | Final evaluation | cross_validate with multiple metrics |
| 6 | Binary classification check | ROC curve + AUC score |
| 7 | Class-level detail | classification_report + confusion matrix |
Summary
- Evaluating a model on its training data produces falsely optimistic results — always use held-out data
- K-Fold Cross-Validation evaluates the model on multiple different splits and averages results for a stable estimate
- Stratified K-Fold preserves class distribution in each fold — essential for imbalanced datasets
- High bias means the model is too simple; high variance means it memorised training data
- Learning curves reveal whether a model is underfitting or overfitting at a glance
- Grid Search exhaustively tests all hyperparameter combinations — best for small grids
- Randomised Search samples random combinations — best for large parameter spaces
- The ROC curve and AUC score evaluate binary classifiers across all classification thresholds
