DS Model Evaluation and Hyperparameter Tuning

Building a machine learning model is only half the job. Evaluating it correctly and tuning it for peak performance are equally important. This topic covers cross-validation, evaluation metrics in depth, bias-variance tradeoff, hyperparameter tuning with Grid Search and Randomised Search, and learning curves — the complete toolkit for producing reliable, production-ready models.

Why Proper Evaluation Matters

A model evaluated on the same data it trained on will always appear more accurate than it truly is. Proper evaluation uses held-out data, multiple folds, and correct metrics to give an honest picture of how the model performs on real-world, unseen data.

The Evaluation Problem

Naive Approach (Wrong):
Train model on ALL data → Evaluate on SAME data → Reports 98% accuracy
↓
Real-world accuracy: 72% ← Model memorised training data

Correct Approach:
Training Data (80%) → Train model
Test Data (20%)     → Evaluate model (never used during training)
↓
Real-world accuracy: matches test accuracy ← Honest estimate

Cross-Validation – More Reliable Evaluation

A single train-test split depends heavily on which 20% of data lands in the test set. Cross-validation removes this randomness by training and evaluating the model on multiple different splits of the data and averaging the results.

K-Fold Cross-Validation

Dataset split into K=5 folds:
┌──────┬──────┬──────┬──────┬──────┐
│ F1   │ F2   │ F3   │ F4   │ F5   │
└──────┴──────┴──────┴──────┴──────┘

Round 1: [Train: F2,F3,F4,F5]  [Test: F1] → Score 1
Round 2: [Train: F1,F3,F4,F5]  [Test: F2] → Score 2
Round 3: [Train: F1,F2,F4,F5]  [Test: F3] → Score 3
Round 4: [Train: F1,F2,F3,F5]  [Test: F4] → Score 4
Round 5: [Train: F1,F2,F3,F4]  [Test: F5] → Score 5

Final CV Score = Mean of Score 1 to Score 5
                + Standard Deviation (measures stability)
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

# Load dataset
data     = load_breast_cancer()
X, y     = data.data, data.target

scaler   = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 5-Fold Cross-Validation
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv    = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X_scaled, y, cv=cv, scoring="accuracy")

print("CV Scores for each fold:", scores.round(4))
print(f"Mean Accuracy : {scores.mean():.4f}")
print(f"Std Deviation : {scores.std():.4f}")
print(f"95% CI        : {scores.mean():.4f} ± {2*scores.std():.4f}")

Output:

CV Scores for each fold: [0.9649 0.9561 0.9737 0.9649 0.9649]
Mean Accuracy : 0.9649
Std Deviation : 0.0057
95% CI        : 0.9649 ± 0.0114

→ Model consistently accurate (low std = stable model)

Stratified K-Fold – For Imbalanced Classes

Stratified K-Fold ensures each fold has the same proportion of each class as the full dataset. This matters when one class appears much more often than another — for example, 95% healthy vs 5% diseased.

# Stratified: preserves class distribution in each fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Check class distribution in one fold
for fold_num, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    y_fold = y[test_idx]
    pos_rate = y_fold.mean()
    print(f"Fold {fold_num+1}: Positive class = {pos_rate:.3f} ({(pos_rate*100):.1f}%)")

The Bias-Variance Tradeoff

Every machine learning model makes two types of errors: bias and variance. The tradeoff between them is the central challenge in model selection and tuning.

Definitions

Bias   = Error from wrong assumptions in the model.
         A high-bias model is too simple — it consistently misses the true pattern.
         (Example: fitting a straight line to curved data)

Variance = Error from sensitivity to small fluctuations in training data.
           A high-variance model is too complex — it memorises noise.
           (Example: a deep decision tree that memorises every training point)

Diagram – Bias-Variance Tradeoff

Total Error
     │  \
     │   \          ← Total Error curve
     │    \        ╱
     │     \      ╱
     │      \    ╱
     │       \  ╱
     │        \/   ← Sweet spot (optimal complexity)
     │       ╱  \
     │      ╱    \
     │─────────────────────────→ Model Complexity
                     ↑        ↑
               High Bias   High Variance
            (Underfitting) (Overfitting)

Bias decreases as complexity grows.
Variance increases as complexity grows.
Best model sits at the balance point.

Learning Curves – Diagnosing Model Problems

A learning curve plots training and validation accuracy against the number of training samples. The shape of the curve reveals whether the model suffers from high bias, high variance, or neither.

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        cv=5,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring="accuracy",
        random_state=42
    )

    train_mean = train_scores.mean(axis=1)
    val_mean   = val_scores.mean(axis=1)

    plt.figure(figsize=(8, 5))
    plt.plot(train_sizes, train_mean, "o-", color="steelblue", label="Training Accuracy")
    plt.plot(train_sizes, val_mean,   "o-", color="tomato",    label="Validation Accuracy")
    plt.fill_between(train_sizes,
                     train_scores.mean(1) - train_scores.std(1),
                     train_scores.mean(1) + train_scores.std(1),
                     alpha=0.1, color="steelblue")
    plt.fill_between(train_sizes,
                     val_scores.mean(1) - val_scores.std(1),
                     val_scores.mean(1) + val_scores.std(1),
                     alpha=0.1, color="tomato")
    plt.title(f"Learning Curve – {title}")
    plt.xlabel("Training Set Size")
    plt.ylabel("Accuracy")
    plt.legend()
    plt.tight_layout()
    plt.savefig(f"learning_curve_{title.replace(' ','_')}.png")
    plt.show()

plot_learning_curve(model, X_scaled, y, "Random Forest")

Diagram – Interpreting Learning Curves

High Bias (Underfitting):         High Variance (Overfitting):
Accuracy                          Accuracy
  │ ─ ─ ─ ─ ─ ─ ─ 1.0            │ ─────────── Train
  │  Train  ≈ Val                 │
  │ ──────────────                │            Val ──────────
  │ ─ ─ ─ ─ ─ ─ ─ 0.7            │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
  └──────────────→                └──────────────→
  Both lines low and close.       Large gap between Train and Val.
  Model too simple.               Model memorised training data.

Well-Fitted Model:
Accuracy
  │ ─────────────────────── Train (high)
  │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ Val (close to Train)
  └──────────────→
  Both lines high, gap is small.

Hyperparameter Tuning

Hyperparameters are settings configured before training begins — they control how the algorithm learns. Tuning these settings to find the optimal combination for a specific dataset is called hyperparameter tuning.

Common Hyperparameters by Algorithm

AlgorithmKey HyperparametersEffect
Random Forestn_estimators, max_depth, min_samples_splitControls tree count, depth, and minimum split size
SVMC, kernel, gammaControls margin width, shape of boundary, kernel influence
KNNn_neighbors, metricControls number of neighbours and distance formula
Logistic RegressionC, penalty, solverControls regularisation strength and type
Gradient Boostingn_estimators, learning_rate, max_depthControls boosting rounds, step size, and tree complexity

Grid Search – Exhaustive Tuning

Grid Search tries every possible combination of specified hyperparameter values. It trains and evaluates the model for each combination using cross-validation and reports the best-performing combination.

from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth":    [3, 5, 10, None],
    "min_samples_split": [2, 5, 10]
}

# Grid search with 5-fold CV
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,        # Use all CPU cores
    verbose=1
)

grid_search.fit(X_scaled, y)

print("Best Parameters:", grid_search.best_params_)
print("Best CV Accuracy:", grid_search.best_score_.round(4))

Output:

Best Parameters: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 200}
Best CV Accuracy: 0.9736

Diagram – Grid Search Combinations

n_estimators:      [50,   100,  200]
max_depth:         [3,    5,    10, None]
min_samples_split: [2,    5,    10]

Total combinations: 3 × 4 × 3 = 36 combinations
Each tested with 5-fold CV → 36 × 5 = 180 model fits

Grid visualised (n_estimators vs max_depth, fixing min_split=2):
              max_depth
             3     5    10   None
n_est  50  [0.94 0.96 0.97 0.95]
      100  [0.95 0.97 0.97 0.96]
      200  [0.95 0.97 0.97 0.96]  ← best cell here

Randomised Search – Faster Tuning for Large Grids

Randomised Search samples a fixed number of random hyperparameter combinations rather than testing all of them. For large grids, this approach finds a near-optimal solution much faster than full Grid Search.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define distributions to sample from
param_dist = {
    "n_estimators":      randint(50, 500),
    "max_depth":         [3, 5, 10, 15, 20, None],
    "min_samples_split": randint(2, 20),
    "min_samples_leaf":  randint(1, 10),
    "max_features":      ["sqrt", "log2", None]
}

# Try 50 random combinations
rand_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=50,          # Number of combinations to try
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
    random_state=42,
    verbose=1
)

rand_search.fit(X_scaled, y)

print("Best Parameters:", rand_search.best_params_)
print("Best CV Accuracy:", rand_search.best_score_.round(4))

Grid Search vs Randomised Search

AspectGrid SearchRandomised Search
MethodTests every combinationSamples N random combinations
SpeedSlow for large gridsFast — controllable via n_iter
GuaranteeFinds global optimum within gridMay miss the optimum (but rarely by much)
Best used whenSmall parameter grid (<50 combos)Large parameter space, limited compute time

Cross-Validated Evaluation of the Best Model

from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score

# Use best model found by GridSearchCV
best_model = grid_search.best_estimator_

# Evaluate with multiple metrics simultaneously
scoring_metrics = {
    "accuracy":  "accuracy",
    "f1":        make_scorer(f1_score, average="weighted"),
    "precision": make_scorer(precision_score, average="weighted"),
    "recall":    make_scorer(recall_score, average="weighted")
}

cv_results = cross_validate(
    best_model, X_scaled, y,
    cv=5,
    scoring=scoring_metrics
)

print("Final Model – 5-Fold CV Results:")
for metric, values in cv_results.items():
    if metric.startswith("test_"):
        name = metric.replace("test_", "")
        print(f"  {name:<12}: {values.mean():.4f} ± {values.std():.4f}")

Output:

Final Model – 5-Fold CV Results:
  accuracy    : 0.9736 ± 0.0082
  f1          : 0.9735 ± 0.0083
  precision   : 0.9737 ± 0.0082
  recall      : 0.9736 ± 0.0082

ROC Curve and AUC – For Binary Classification

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate against False Positive Rate at every classification threshold. The AUC (Area Under the Curve) summarises this into a single number — a perfect model achieves AUC = 1.0.

from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

best_model.fit(X_train, y_train)
y_proba = best_model.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, color="steelblue", lw=2, label=f"ROC Curve (AUC = {auc_score:.4f})")
plt.plot([0,1],[0,1], "k--", lw=1, label="Random Classifier (AUC = 0.5)")
plt.title("ROC Curve – Random Forest (Best Model)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.tight_layout()
plt.savefig("roc_curve.png")
plt.show()

print(f"AUC Score: {auc_score:.4f}")

Diagram – Reading the ROC Curve

True Positive Rate (Recall)
  1.0 │           ╭──────────────────
      │        ╭──╯
      │      ╭─╯   ← Good model (AUC ≈ 0.97)
  0.5 │    ╭─╯
      │  ╭─╯
      │╭─╯ ← Random baseline (diagonal line)
  0.0 └──────────────────────────────
      0.0      0.5                1.0
           False Positive Rate

AUC = 1.0 → Perfect model (top-left corner)
AUC = 0.5 → No better than random guessing (diagonal)
AUC = 0.0 → Perfectly wrong (bottom-right)

Complete Model Evaluation Checklist

StepActionTool
1Baseline evaluationtrain_test_split + accuracy_score
2Robust evaluationcross_val_score (5-fold)
3Diagnose bias/variancelearning_curve()
4Tune hyperparametersGridSearchCV or RandomizedSearchCV
5Final evaluationcross_validate with multiple metrics
6Binary classification checkROC curve + AUC score
7Class-level detailclassification_report + confusion matrix

Summary

  • Evaluating a model on its training data produces falsely optimistic results — always use held-out data
  • K-Fold Cross-Validation evaluates the model on multiple different splits and averages results for a stable estimate
  • Stratified K-Fold preserves class distribution in each fold — essential for imbalanced datasets
  • High bias means the model is too simple; high variance means it memorised training data
  • Learning curves reveal whether a model is underfitting or overfitting at a glance
  • Grid Search exhaustively tests all hyperparameter combinations — best for small grids
  • Randomised Search samples random combinations — best for large parameter spaces
  • The ROC curve and AUC score evaluate binary classifiers across all classification thresholds

Leave a Comment